improves Perromance Element based on feedback from Critical Element
Two Goals, for improving Perfomance Element
Make it Find a better solution
Make it find a solution Faster
Cricial
Provide feedback, based on some outside standard, to Learning Element
Problem Generator
Provide oppertunities to Learn
Sugest Alternative - actions that have unknown outcome
Learning Functions
Hypothisis = Approximation of function to be learnt
Larger Hypothisis space, more likely to contain perfect function, but harder to find
Slower Convergance
More Examples Needed
Decision Tree Induction
Algorithm (Recursive)
Does the whole group have same Output value. If so place leaf node with that in output tree
Select the most descriminating attrinute
Break examples into groups depending on value of that attibute
Place branch in output tree
Recursivly proccess groups (removing the dividing attribute)
Determining the most descriminating node
Causes most number of exapleswith same output value to be in same group
Cause groups of even size
I don't think thisis good
Minimises total Entropy
Explict vs Implict Repressentation
Explict
Eg. a Lookup Table
Implict
Eg Linear weighter Sum Of Features
Advantages
Generalisation
Can return estimate of value for states not yet seen
Space
Don't need to remember large data structure
5. Sequential Decision Problems
Utility depends on Sequence of Descions
Utilities only known in terminal states
Uncertain Domains
Stochatic actions
Nondetermanistic
Transition Model
Probability of being in a particular state after taking action
M(i,a,j)=P(j|i,a)
Markov Property: Transition Probability Depends only on the state (not on past history)
Markov Decision Problem: Calculating Optimal Policy in an enviroment that is:
Accessible
Youcan tell what state you are in
Stochastic enviroment
with knon Transition Model
Step Cost
normally <0
Smaller (absolute) value => More convervitive policy
Larger (absolute) value => more direct (riskier) policy
Reward(i) = StepCost(i) + Utility(If i terminal)
Solving For Utility Values
Bellmans Equation
U(i) = R(i) + Max(Over all actions a) { Sum(over all states j) {M(i,a,j)*U(j)} }
NonLinear (contains Maxes)
Value Iterastion
U'(i) = R(i) + Max(Over all actions a) { Sum(over all states j) {M(i,a,j)*U(j)} }
Repear until U' close enough to U
Policy Iteration
To Get Utilities From P
solve U(i) = R(i) + { Sum(over all states j) {M(i,P(i), j)*U(j)} }
Or use a fixed number of steps of value iteration:
U'(i)=R(i)+Sum(over all k){M(i,P(i),j)*U(j)} Repeat this, a fixed numberof times, without updating policy, in order to determent utility
Called "Modified Policy Iteration"
For large number of states, faster than orignial policy iteration
Check for better action for each state
If there is some action a, a!=p(i), that results in higher utility value for U(i) than taking P(i), Update P(i) to take that action
If none found then policy has converged and is optimal
Else recalcuate utilites for new policy
Policy Iteration often faster than value iteration because it is required for the utilities value to be the same in order to end up with the same (optimal) policy
Enviroment contains no terminal states (Agent Lives Forever)
Additive Utilites become infinite, as paths become infinitely long
Value Iteration never terminated
Discounting is an alterntive to Aditive utility
the Utility of a sequence of states: U(s0...sn)=R(sn)*y^n For some constant y<=0
This has a solutiuon even for infinitely long chains of states
This factor y, can be added to our Bellman like equations, in direct, P.I, and V.I, U(i) = R(i) + y*Max(Over all actions a) { Sum(over all states j) {M(i,a,j)*U(j)} }
6. Reinforcement Learning
Issues:
If reward only at end of game, hard to know which moves were the good ones unless alot of games are played
Advantages
Can learn transition model
Active Or Passive
Active: Must choose Own moves
Optimistic Prior
To Start Explorative and become exploititive
initallly assume everthing is good
U+(i)<-R(i)+Max(over all a){ f (sum(overall j)<(i,a,j), N(a,i)}
f(u,n) is the explortation function where u is a utility estimate, and n is the number of times this (node,action) pair has been tried before
Example:
f(u,n)=
{
1000, if n<10
u , otherwise
}
This means will try each state 10 times
Can turn Passive method into Active,
By learning Model
Using perfomance element to select move based on known utilites and taking advantage of an exploration function
Passive: Moves provided to Agent (learning element)
Utility Learning
learns Utility Function
LMS
Least Mean Square
Algorithm: At the end of some chain of states: Move backwards though the chain: at state ei:
Reward-to-Go+=Reward(ei)
U(ei)=RunningAverage(U(ei), Reward-to-go)
Ignores the Model. State utilities are interdependent. this does not take advantage of the information
ADP
Adaptive Dynamic Programming
Solve U(i)=R(i) + Sum(over all j){M(i,j)U(j)} M(i,j) is given to us as the model Once enough sample sequences have been given, we can solve this for U(i) (and R(i))
Intractable in large spaces
Useful theortical benchmark
Similar to value determination in SDP
TD
Temporal Difference Learning
Algorithm
For Terminal States
Utility= running average of reward at that state
For non-terminal states:
If we encounteded state j, immediatlye after state i,
Then U(i)<-U(i) + y(R(i)+U(j)-U(i))
Improved version: Replace y, with a function that decreases the more state i has been seen
y is a constant: 0<=y<=1
Makes use of relationships between states implictly
Rather than explictly (thogh the transition matrix M(i,j))
Active ADP
If get to chose action(Active)
As States encountered:
Update Action Model
Use value iteration to find utilites (making use of the updated Action Model)
Get action from Perfomance element
Chosing best move using standard: Argmax(over all b) Sum(over all j)U(i,a,j)
Making sure to Keep track of previos state, and action done so that can update Action model, at the next step
Model Learning
In passive learning (thus no actions known)
M(i,j)=% of time in state i, and transtioning to statej
Active learning
M(i,a,j)=% time taking action a, in state i, leads to state j
Model needed
to go from Utilty Table to Policy table
For ADP
Not For TD, as reward for getting to unexpected state, only occurs in proporton to chance of being in that state
This applies wether in active or in passive
Q-Learning
Learns action-value function: The utlity of taking an action in a state
Learns: Q(i,a)
Could convert to Utilities by: U(i)=Max(over all actions a){Q(a,i)}
No model required
Does not consider what state action leads too
Must know actions taken at states (good to implement as active)
TD
For j= current state i = previos state a = last action taken y = learning rate Q(a,i)<-Q(a,i)+y(R(i)+Max(over all actions b){Q(b,j)}
Performance element:
argmax(over all actions b) f(Q(b,j), N(b,j))
compaired to Utility learning
Slower to converge. Greater policy Loss
Constancy between values not enfored by model
7. Planning
Algorithm:
Implement start condions as postcondions of a Start action, and Goal conditions as Preconditions of End action. Start Order before End
Are there any open preconditions? If not Return plan
Choose a Open precondion C, for some step S
Add, or move an action A, that has C as a post condition of C, Ordering before S
Detect and Deal with Clobbering
Clobbering
Defn: A new action C clobbers, exsiting actions A and B, where postcondition of A meets a precondion for C, if by interleaving ACB, C would cause that post condition to no longer be met
Resolution
Demotion: Order C before A
Promotion: Order C after B
Runtime Failure
Contigency Planning
Plan to Obtain Information, with Observation Actions
Explensive because must make lots of contingency plans for unlikely conditions
Hard to know all all possible failures in real world
If a Open Condition can be established by an Observation Action, Add that Observation action to plan, Make a subplanforthat condion
Monitoring
Two types of Monitoring
Execution Monitoring
Check if any Future Precondions that should have been met are not
use Causal Links from partial planner
Active Monitoring
Check if Precondions of Next action are met
Check if action actively fails
Replanning
Replan from Scratch
Plan to get back on track
Much more effienct for long plans
Situated Planning
Combines Execution Montoring and Planning into one proccess
Many Activities
Execute Plan Step
Monitor world
Fix problems in plan
Clobbering
Open Conditions
Refine plan based on new information
World State changed
Possibly due to actions of other agents
Action Failure
8. Logical Agents
Sentences
Satisfiable
A sentence is statisfiable if there exists at least one model in which it is true
A model m, satisfies a sentence s, if s is true in m
A model m, satisfies a KB of sentences, if all sentences in KB are true in m
ie if m satifies all sentences in the KB
Valid
A sentence is valid if it is true in all models
Examples
True
Av~A
A=>A
A^(A=>B)=>B
Not related to syntaxtically valid
Unsatisfiable
A sentence is unstatifyiable if it is true in no models
Examples
False
A^~A
A=>~A
Entailment
A Knowledge base KB, entails a sentenes S if If the Models satifying KB, are a subset of the Models satisfying S
Things can satify S without satifying KB, but not the reverse
Locally Equivelent
Two sentences are logically equivelent iff they entail each other
ie the set of models satisfying one, is the same as the set of models satisfing the other
Soundess
A system is sound, if it can only generate logical consequences
KB|-A => KB|=A
Ie All things infered can be entailed
Completeness
A system is complete if it can generate all logical consiquences
Ie KB|=A => KB|-A
ie all things entailed can be inferred
Inference Systems
Chaining
Horn Form
Premise=>Conclusion
Premise= Conjustion of Propostions
Conclusion = Proposition
Or Statement
Stand alone proposion
Forward Chaining
Algorithm
Find Implication rules that have all their premises in KB,
But not their conclusions
Add Conclusion
Repeat until RTP is proven
Slower than backwards chaining as proves lot of other thigns along the way
Backwards chaining is more directed
Backwards chaining
Algorithm
Check if RTP in KB If so return true, if not:
Find all Implications that has RTP as Conclusion
If for any of the Implications, you can prove all the premises (with Backwards chaining) then RTP is true
If not, then RTP is false
Not Complete unless you avoid enqueing things to be proven that are already there
Can be made faster by caching truths and falsehoods
Resolution
Conjunctive normal form
Knowlege base as a whole is a Conjuction of Disjunctions
a conjuction of clauses
an And of Ors
Clauses are Disjuctions (Ors) of symbols (terms)
Symbols (terms) are positive or negitive propostions
Resolution inference
Or Two clauses containing complimentairy terms. Result is the Or without those terms
Satisfiability Problem
Useful to find out if htere are models (and return one), for a KB
Especaillay if you Add to the KB something you want to know
Or its complement (for proof by contradiction)
NP-Complete
DPLL (Variation of Davis-Putman)
Depth First Search to find a model that meets a sentence
Terminate with False, if any clause in sentence is false
ie if any clause has all its terms false
Terminate with True if all Clauses true
Remove Pure Symbols
Symbols that don't have there complement in any clause
They can't affect the result
Add them with there required value to the model
Recurivly try to find model for remainign sentence
Handle Unit Clauses
Clause is one symbol long
Since sentence is only true if All CLauses true
This clause defines the value for its symbol
Add that value to the model
Recurivly try to find model for remainign sentence
Otherwise, Chose a symbol
Condiser (with DPLL) the case for that symbol being true, or false
Return the or of these
Walker Sat
Performs a semirandom walk to locally search for a Model that fits the sentence
Parameters
Max flips: the maximum number of iterations to try
P: Nomaly about 0.5
Initally Model is random assiment of true/false to all the symbols in sentence
Alogorithm: Repeatedly
If Model satifies clause then Done, else
randomly pick a false clause C
With Probability P either
Flip a random symbol from C's bit in the model
Flip a symbol in C, such that you maximise the overall number of clauses that are true
9. First Order Logic
Situation Calculus
Can formulate Planning as inference on KB
Situation Function: Result(a,s)
a is a action
s is the situation function,
prior to a being taken
Effect Axiom
Pre(s) -> Post(Result(action,s))
Post (and Pre) are both predicates on the stituation
Problems
Ramification Probem
Can't know all post conditons.
Probabialistic Postcondions (only happens some of the time)
Qualification Problem
Can't know all Precondions
Frame Problem
How to deal with thing that don't Change
Inference: default things to not changing
York Shooting Problem (Not Examinable)
Repressentation
Successor State Axiom
P is true now iff
It was made true in the past state Or
was true in the state before the past state, and the past state did not make it untrue
P(result(a,result(b,s))) <=> P(result(b,s)) v (P(s) ^ DidNotUndoP(a))
General
Unification
For sets of substitutions S
Unify(a,b)=S if aS=bS
Standardising Apart
Renames variables that have same name, but different meaning
New variable names, irrespective of scope
Skolemize
Method for removing existentially variables
Replace variable with Function of enclosing universal qualified variables
Called Sklem function
Required for resolution
Inference Systems
Chaining
Generalised Modul Polens (GMP)
For KBs
of Definite clauses
exactly 1 positive literal, per clause
All variables universally qualified
p1',p2,',....pn',(p1,p2,...pn=>q) |- qS
where pi'=piS for all i
So pi' and pi are same predicate with different variables