More detailed info and reading can be found in the repo subdirectories
http://wiki.omscs.org/confluence/display/CS7641ML/CS7641.FA14.+Final+exam+prep
UL consists of algorithms that are meant to "explore" on their own and provide the user with valuable information concerning their dataset/problem
-
Randomized Optimization
-
Clustering
- Single Linkage
- Consider each of n points a cluster
- Find the distance between the closest two points in every cluster
- Merge the closest two clusters
- Repeat n - k times to get k clusters
- Method:
- Consider each point a cluster
- Merge two closest clusters
- Unless we have K clusters GOTO 2
- Links points closest to each other.
- Can result in "stringy" non-compact clusters
- K-Means
- Each iteration is polynomial
- Finite (exponential) iterations in theory, but usually much less in practice
- Always converges, but can get stuck with "weird" clusters depending on random
starting state
- Place k centers
- Claim closest points
- find the centers of the points
- Move the centers to the clusters of points
- Unless converged GOTO 2
- Expectation Maximization
- Gaussian Means
- Uses expectation and maximization steps
- Monotonically non-decreasing likelihood
- Does not converge (practically does)
- Can get stuck
- Works with any distribution (not just Gaussian)
- Properties of Clustering Algorithms (Pick 2)
- Richness
- Scale Invariance
- Consistency
- Richness
- For any assignment of objects to clusters, there is some distance matrix, D, such that P_D returns that clustering
- Scale-Invariance
- Scaling distances by a positive value does not change the clustering
- Consistency
- Shrinking intra-cluster distances and expanding intercluster distances does not change the clustering.
- No clustering scheme can acheive all of Richness, Scale-Invariance, Consistency
- Single Linkage
-
Feature Selection
- Filtering
- Choose features independent of learner. i.e. "filter" the data before it is passed to the learner
- Faster than wrapping (don't have to pay the cost of the learner)
- Tends to ignore relationships between features
- Decision Trees do this naturally (Filter on information gain)
- Wrapping
- "Wrap" the learner into the feature selection. Choose features based on how the learner performs.
- Takes into account learner bias
- Good at determining feature relationships (as they pertain to the success of the learner)
- Very slow (have to run the learner for each feature search)
- Speed Ups
- Randomized optimization
- Forward/Backward sequential selection: good description and implementation
- Relevance
- Relevance vs. Usefulness
- Relevance measures the effect the variable has on the Bayes' Optimal Classifier
- Usefulness measures the effect the variable has on the error of a particular predictor (ANN, DT, etc.)
- Filtering
-
Feature Transformation
- Polsemy: Same word different meaning - False Positives
- Synonomy: Different word same meaning - False Negatives
- PCA: Good Slides
- Example of an eigenproblem
- Finds direction (eigenvectors) of maximum variance
- All principal components (eigenvectors) are mutually orthogonal
- Reconstructing data from the principal components is proven to have the least possible L2 (squared) error compared to any other reduction
- Eigenvalues are monotonically non-increasing and are proportional to variance along each principal component (eigenvector). Eigenvalue of 0 implies zero variance which means the corresponding principal component is irrelevant
- Finds "globally" varying features (image brightness, saturation, etc.)
- Fast algorithms available
- ICA
- Finds new features that are completely independent (from each other). i.e. they share no mutual information
- Attempts to maximize the mutual information between the original and transformed data. This allows original data to be reconstructed fairly easily from the transformed data.
- Blind Source Separation (Cocktail Party Problem)
- Finds "locally" varying features (image edges, facial features)
- RCA
- Generates random directions
- It works! If you want to use it to preprocess classification data...
- Is able to capture correlations between data, but in order for this to be true, you must often reduce to a larger number of components than with PCA or ICA.
- Can't really reconstruct the original data well.
- Biggest advantage is speed.
- LDA
- Requires data labels
- Finds projections that discriminate based on the labels. i.e. separates data based on class.
-
Information Theory
- Entropy: A characterization of uncertainty about a source of information
- Joint Entropy: The entropy contained by the combination of two variables
- Conditional Entropy: The entropy of one variable, given another
- Mutual Information: The reduction of entropy of a variable, given knowledge of another variable
- KL Divergence: A non-symmetric measure of the difference between two probability distributions P and Q
Reinforcement Learning: A Survey
Put an agent into a world (make sure you can describe it with an MDP!), give him some rewards and penalties and hopefully he will learn.
-
Markov Decision Processes
- Building a MDP
- States
- MDP should contain all states that an agent could be in.
- Actions
- All actions an agent can perform. Sometimes this is a function of state, but more often it is a list of actions that could be performed in any state
- Transitions (model)
- Probability that the agent will arrive in a new state, given that it takes a certain action in its current state: P(s'|s, a)
- Rewards
- Easiest to think about as a function of state (i.e. when the agent is in a state it receives a reward). However, it is often a function of a [s, a] tuple or a [s, a, s'] tuple.
- Policy
- A list that contains the action that should be taken by the agent in each state.
- The optimal policy is the policy that maximizes the agent's long term expected reward.
- States
- Utility
- Value Iteration
- "Solve" (iteratively until convergence, more like hill climb) Bellman Equation.
- When we have maximum utility, the policy which yields that utility can be found in a straightforward manner.
- Policy Iteration
- Start with random (or not) initial policy.
- Evaluate the utility of that policy.
- Update policy (in a hill climbing-ish way) to the neighboring policy that maximizes the expected utility.
- Discount Factor, (typically between 0 and 1), describes the value placed on future reward. The higher is, the more emphasis is placed on future reward.
- Building a MDP
-
Model-Based vs. Model-Free
- Model-Based requires knowledge of transition probabilities and rewards
- Policy Iteration
- Value Iteration
- Model-Free gets thrown into the world and learns the model on its own based
on "[s, a, s', r]" tuples.
- Q Learning
- Model-Based requires knowledge of transition probabilities and rewards
-
Three types of RL
- Policy Search - direct use, indirect learning
- Value function based - ^Argmax
- Model based - indirect use, direct learning ^Solve Bellman
-
Q Learning
- Q Function is a modification of the Bellman Equation
- Learning Rate, , is how far we move each iteration.
- If each action is executed in each state an infinite number of times on an infinite run and is decayed appropriately, the Q values will converge with probability 1 to Q*
- Exploration vs Exploitation
- Epsilon Greedy Exploration
- Search randomly with some decaying probability like Simulated Annealing
- Can use starting value of Q function as a sort of exploration
- Epsilon Greedy Exploration
-
Game Theory
- Zero Sum Games
- A mathematical representation of a situation in which each participant's gain (or loss) of utility is exactly balanced by the losses (or gains) of the utility of the other participant(s).
- Perfect Information Game
- All agents know the states of other agents
- minimax == maximin
- Hidden Information Game
- Some information regarding the state of a given agent is not know by the other agent(s)
- minimax != maximin
- Pure Strategies
- Mixed Strategies
- Nash Equilibrium
- No player has anything to gain by changing only their own strategy.
- Repeated Game Strategies
- Finding best response against a repeated game finite-state strategy is the same as solving a MDP
- Tit-for-tat
- Start with cooperation for first game, copy opponent's strategy (from the previous game) every game thereafter.
- Grim Trigger
- Cooperates until opponent defects, then defects forever
- Pavlov
- Cooperate if opponent agreed with your move, defect otherwise
- Only strategy shown that is subgame perfect
- Folk Theorem: Any feasible payoff profile that strictly dominates the
minmax/security level profile can be realized as a Nash equilibrium payoff
profile, with sufficiently large discount factor.
- In repeated games, the possibility of retaliation opens the door for cooperation.
- Feasible Region
- The region of possible average payoffs for some joint strategy
- MinMax Profile
- A pair of payoffs (one for each player), that represent the payoffs that can be achieved by a player defending itself from a malicious adversary.
- Subgame Perfect
- Always best response independent of history
- Plausible Threats
- Zero Sum Stochastic Games
- Value Iteration works!
- Minimax-Q converges
- Unique solution to Q*
- Policies can be computed independently
- Update efficient
- Q functions sufficient to specify policy
- General Sum Stochastic Games
- Value Iteration doesn't work
- Minimax-Q doesn't converge
- No unique solution to Q*
- Policies cannot be computed independently
- Update not efficient
- Q functions not sufficient to specify policy
- Zero Sum Games