Skip to content

Class Structure

lymacasm edited this page Aug 10, 2017 · 10 revisions
Table of Contents

Overview
Cards and Hands
    CardProperty
    Card
    Hand
        Labelled Class v.s. Assigned Class
Genetic Algorithm/Programming Classes
    Condition
        Creating a New Condition
            function
            __str__
            is_string_condition
            gen_rand_params
            mutate_params
            _parameter_equality
    Gene
        Confidence
        Mutation Resistance
    Chromosome
        Default Class
        Fitness
    Population
        Population Initialization
        Crossover
            Random Selection
            Roulette Selection
            Rank Selection
        Mutations
            Chromosome Insertion
            Chromosome Deletion
            Chromosome Default Class Mutation
            Gene Insertion
            Gene Deletion
            Gene Recombination
            Gene Class Mutation
            Gene Confidence Mutation
        Fitness Computation
        One v.s. All
        Parameters

Overview

Inheritance and classes are heavily used to maintain a useful structure for the code. This doc tries to explain how all the classes fit together. I've structured the python code so that mostly all the classes are in separate files (except for CardProperty and Card, which are both in card.py).

Cards and Hands

CardProperty

Each card has two properties: the suit and the value. A property contains both its numeric representation and its string representation. The numeric representation is encoded as described here.

Card

As mentioned above, each card has two properties: its suit and its value. A Card object can either initialize its CardProperty's using the numeric property representations, or the string property representations. A ValueError exception will be thrown if either the number for numeric representation doesn't belong to any possible value, or the string for string representation doesn't belong to any of the possible strings. Code using this property should be tested to ensure that no ValueErrors are being thrown.

Hand

A hand is, essentially, a list of cards. In the case of the current dataset, a hand is a list of 5 cards. A hand is initialized by passing a row from the dataset in pandas format to the class.

Labelled Class v.s. Assigned Class

If the dataset is labelled, the hand will initialize its labelled_class property with the label. Once initialize, the labelled_class property should never be changed. The assigned_class property, however, is the class that the classifier has assigned it too. The fitness property will check whether the assigned_class and the labelled_class are equal. The assigned_class property can be changed as needed.

Genetic Algorithm/Programming Classes

Condition

A condition is the basic building block for building our rules. It can check whether two card values are equal or unequal, if suits are equal or unequal, etc. Each specific condition inherits from the Condition base class. The Condition base class has two properties: function, and parameters. Function is an instance of a function to be called to evaluate the condition. The function should return either True or False. Parameters are a tuple of values. The amount of parameters can vary. The function must be called with a Hand as the first input, and the parameters as the next inputs.

Creating a New Condition

There are various things that need to be implemented when creating a new condition. The new condition must be made in the condition.py file, and must inherit from Condition. The new condition must be added to the list of condition classes that get returned from the get_conditions function in the Condition class. The function containing the actual condition must be implemented, as well as a way to generate the proper parameters randomly. The class must have a __str__ method implemented that will return a string of the python code that executes your function (all in one line). Conversely, a function must be overridden that will return True or False as to whether a string was generated from the condition. A method must be implemented for mutating the parameters. You must determine whether you need to override the default _parameter_equality method to check for condition equality.

function(hand, *parameters)

You should make your function a static method in your class. That means it doesn't need self as an input. The name of the function should be the same as the name of your class, except the class should use pascal case (MyCoolClass), and the function should use underscores (my_cool_class). The function must return either True or False, and should be able to be executed in one line (for code generation, it will be a single condition in an if statement).

__str__(self)

This method must be implemented for code generation to work properly. It should basically return your function in string format. Make sure to put the values of your parameters in the string before returning it! (If this is implemented, print Class will call this function to convert your class into a string).

is_string_condition(string)

This method must be implemented if reading conditions from a file is to work correctly. It must be able to recognize if a string was generated using its __str__ method (so basically check if input string looks like the __str__ output). If this is not implemented, when loading conditions from a file, a NotImplemented exception will be thrown.

gen_rand_params(self, seed)

This function must be implemented in any new class, containing both an instance of itself, and a seed. The seed is used to seed the random number generator. In this function, you control how many parameters your class has, as well as the values these parameters can take on. This function must randomly generate parameters, and return them as a tuple. The parameters must be returned in the order they will appear in the function.

mutate_params(self, mutation_rate, seed)

Since parameters will be different for every type of condition, they will also be mutated differently. Here, you'll need to implement a method to mutate the parameters. The mutation_rate will be between 0 and 1, and each parameter will need a random check to see whether it will be mutated. 0->Low probability of mutation, 1->High probability of mutation.

_parameter_equality(self, other)

The self parameter is yourself, the other parameter is another version of yourself. This is used to compare whether two conditions are equal (self == other). This is important to reduce redundancy (remove equal conditions) and improve computation time (less conditions to check if they're all unique). When this function has been called, you assume that self and other are both the same condition. The default function checks if the parameters tuple of self is equal to the parameters tuple of other. In most cases, the default is good, and this function does not need to be overriden from Condition. However, in some cases, it does. Example: val[idx1] == val[idx2], self.parameters = (idx1, idx2). In this case, val[idx2] == val[idx1] is the same thing as val[idx1] == val[idx2]. That means that the check has to be more complicated, because the order of the parameters does not matter. In these cases, where the parameter order doesn't matter, the function must be overriden and implemented in the derived class.

Gene

The goal of a gene is to describe under what conditions a hand belongs to a certain class. In essence, a gene is a list of conditions. Each gene is associated with a single class (poker hand), and to determine whether a hand belongs to its class, it will pass the hand to each condition, and if they all are evaluated to True, then the gene will say that the hand belongs to that class.

Confidence

I've experimented with this idea that a gene doesn't just say a hand belongs to a class, it also says how much a hand belongs to a class. Currently, the confidence of a gene can range from [-100, 100]. This allows very broad genes to be built, as well as allowing the option for a gene to say that a hand doesn't belong to a class under these conditions (confidence < 0). Generally, when selecting which class a hand belongs to, you select the hand with the highest confidence. If the hand has a low confidence for all classes, you can then select the default class.

Mutation Resistance

This is something I've experimented with that seems to be extremely important in having a timely evolution process. The idea is, the more times a gene makes a correct classification, the higher its mutation resistance. In this case, the higher the mutation resistance, the less likely a gene is to be mutated. This way, genes that are very important in the classification process are more likely to be passed on without any tampering, and genes that perform poorly are more likely to be mutated. This greatly speeds up the process of evolving the population into a good solution, however, there are some caveats.

  • Simple Rules: This could make it more likely to have simple rules, that a lot of the hands have in common, yet they may not actually describe the class.
  • All Genes >100% Mutation Resistance: It can also make the evolution process come to a standstill further along in the training process, as all the genes now have a mutation resistance greater than 100%, so that there are never new genes introduced into the gene pool through mutations.
  • Different Proportions: The classes aren't evenly distributed among the hands. This can make it very difficult to determine a method to assign mutation resistance, as the classes that have a higher representation in the dataset will also be more likely to be assigned correctly by a gene, therefore genes of certain classes may have higher mutation resistances. On the other hand, if the mutation resistance is assigned based on the number of data points in its class, then the mutation resistance for those genes whose classes have a small number of data samples may have an easier time getting high mutation resistances than the others, as they could more easily classify a higher proportion of their class correctly.

While these seem like very troubling issues, the alternative is worse. Oftentimes, I'd leave code running overnight without this feature, and the accuracy would stay very low and barely improve the entire night. With this feature, I can get a better accuracy within 10s of minutes than I could while leaving it overnight. This property seems to be essential to making progress with the Genetic Algorithm, however the tuning and implementation of it is quite difficult. Currently, its based on the number of data points that are in the class, so if a gene sees a data point from its class, its mutation resistance increases by (1 / number_of_data_points_in_class). As mentioned above, this has its problems, and makes it easier to learn the classes with a small number of data points, and harder to learn the classes with a large number of data points.

Chromosome

A chromosome is in essence a list of genes that make up our solution. To classify a hand, the chromosome passes that hand to each and every gene in its list, and picks the class with the highest confidence.

Default Class

I've experimented with a default class feature, where if the confidence of all classes for a hand is less than 100 (or any constant, but this is the one I've chosen for now), then the hand will be classified as belonging to the default class. My hopes for this feature were that if it couldn't find the rules for, say, the "No Hand" class, then it would make that the default class, and classification for that class would improve. But, unfortunately, that's not always the case. It would definitely give a good jab at finding rules for the "No Hand" class, and would often set one of the other classes to the default class. I'm not so sure this feature helped very much, but at least it gives us a class to fall back on if all else fails.

Fitness

I'm currently calculating fitness on a class-by-class performance. Originally, I used a fitness that counted the number of correct classifications. Using this, I was able to get 92% accuracy on the training dataset (Wow!)! However, considering that the first 2 classes make up 92% of the dataset (and there are 10 classes, and the first two are "Nothing" and "One-pair"), I basically made a classifier that could distinguish between nothing and one-pair, which isn't the best. My current fitness function takes the percent accuracy of each class, and adds them up. So, each class can contribute up to 100 points to the fitness, and with 10 classes, that's a maximum fitness of 1000 points (perfect classification of training set). This is good, cuz I can learn the more complicated classes, but bad because it's generally not very good at learning the bigger classes, so my classification accuracy is usually <50%. So yeah, that kinda sucks, but hey, at least I know what a Royal Flush looks like!

Population

This is where it all goes down. This is where the chromosomes are born, this is where they die. This is where legends are made! (I'm bored and tired, ok?)

Anyways... This class is basically a list of chromosomes. In this class, the crossover, mutations, population initialization, and fitness computations take place. This is basically where all the genetic algorithm stuff is. I've also got lots of features in here that I've implemented, and that can be turned on by assigning certain properties to certain values. I don't know if I can remember every single feature I implemented in hopes of improvement, but I'll try! But before that, I'll go over the basic genetic algorithm stuff.

Population Initialization

Population initialization randomly generates a fixed number of chromosomes, equal to the population_size parameter. Each chromosome has a random number of genes (maximum equal to max_genes), and each gene has a random number of conditions (maximum equal to max_conds). All fundamental parameters of the Chromosomes and Genes are initialized randomly.

Crossover

This is where parents are selected to breed together to make children for the new generation. The process is pretty much like this:

  1. Select parents (a parent can only be selected once)
  2. Use the genes of parent to breed new children (genes randomly assigned to children)
  3. Mutate children (multiple kinds of mutations)
  4. Remove any children without genes
  5. Remove any genes without conditions
  6. Trim the population as needed
  7. Add children to population

In step 1), there is the selection process, where the parents can be selected. There are 3 selection methods: rank selection, roulette selection, and random selection. Each selection method is in itself random, however the different methods have different selection probability distributions. The selection method is chosen randomly for each selection. After a parent has been selected, it is now no longer a candidate for a new selection. This is to keep up diversity in a population, and avoid situations where every gene had the same parent, cuz that can really slow down the evolution process, as it would then be solely relying on mutations to progress the population. The selection methods are described as follows:

Random Selection

In this method, each chromosome has an equal probability of being selected as parent. This method is used for selection 11.11% of the time.

Roulette Selection

In this method, the probability of selecting a chromosome for crossover is directly proportional to the chromosome's fitness. This method is used for selection 44.44% of the time.

Rank Selection

In this method, the probability of selecting a chromosome for crossover is directly proportional to the chromosome's rank. The rank is determined by sorting the chromosomes in ascending order by fitness. The rank is its position in the list. The chromosome with the highest fitness will have a rank equal to the length of the list, and the chromosome with the lowest fitness will have a rank equal to 1. This method is used for selection 44.44% of the time.

In step 2), the parent genes are used to make new children. This works by going through each gene of both parents, and randomly assigning the genes to different children.

In step 3), the children are mutated. Look at the mutation section for more info.

Step 4-6 are cleanup steps, where anything that is not useful and would only slow down classification is removed.

In step 7, the children are added to the population. The population is sorted by fitness, and the chromosomes with the lowest fitness are removed to make way for the children. However, when the lifetime parameter of the population is set to a value > 0, only chromosomes older than the lifetime will die (where lifetime is measured in generations).

Mutations

There are many kinds of mutations, and mutations can be applied either to chromosomes, or their genes. There are three kinds of chromosome mutations:

  1. Insertion
  2. Deletion
  3. Default Class Mutation

Chromosome Insertion

Insertion randomly adds genes into the chromosome. It can insert up to a maximum of max_genes, and each of max_genes has a probability of insertion.

Chromosome Deletion

Deletion randomly removes genes from the chromosome. It loops through each gene, and deletes it with a given probability. This probability is reduced if genes have a high mutation resistance. WARNING: If mutation resistance > probability of deletion, the gene will never be removed through deletion.

Chromosome Default Class Mutation

The default/fallback class of a chromosome is randomly assigned a new value, with a given probability.

All gene mutations' probabilities of mutation are reduced by the mutation resistance. There are 5 kinds of gene mutations:

  1. Insertion
  2. Deletion
  3. Recombination
  4. Class Mutation
  5. Confidence Mutation

Gene Insertion

Insertion randomly adds new conditions to a gene. It can add a maximum of max_conds to the gene.

Gene Deletion

Deletion randomly removes conditions from a gene.

Gene Recombination

Recombination randomly changes the parameters of a condition in the gene to random values.

Gene Class Mutation

Class Mutation randomly assigns a new class to the gene.

Gene Confidence Mutation

Confidence Mutation randomly assigns a new confidence to the gene.

Fitness Computation

The fitness computation is completely by using each chromosome in the population to classify all the hands provided for training. The actual fitness calculation itself is done within each chromosome.

One v.s. All

One v.s. All mode is a feature that can be used to try learning the rules for classes one-at-time. What it does, is it forces all genes in the population to be a certain class. There are two ways to make classifications in this mode:

  1. Say that a hand belongs to this class
  2. Say that a hand doesn't belong to this class

Parameters

When the Population Class is initialized, many parameters can be passed to it, at least three parameters MUST be passed to it. The three that must be passed are:

  1. population_size: The number of chromosomes in the initial population
  2. max_genes: The maximum number of genes per chromosome in the initial population
  3. max_conds: The maximum number of conditions per gene in the initial population All other parameters have default values. The following will attempt to adequately describe the function of each parameter.

crossover_rate

This parameter can range between (0,1). It contains the percentage of the population to be used in crossover. If this parameter is 50%, then half the population will be selected as parents to generate the same number of children. Since we can only have an integer number of parents or children, this number is rounded down, i.e., parent/children_count = floor(crossover_rate*population_size). This can be further affected by the number of parents for each crossover event.

mutate_rate

This parameter can range between (0,1). It contains the probability that a mutation can occur. This can be used to control mutations across all varieties, without changing each individual probability.

insert_rate_c

This parameter can range between (0,1). It contains the probability that an individual chromosome insertion might occur.

delete_rate_c

This parameter can range between (0,1). It contains the probability that an individual chromosome deletion might occur.

class_rate_c

This parameter can range between (0,1). It contains the probability that the default class of a chromosome will mutate.

insert_rate_g

This parameter can range between (0,1). It contains the probability that an individual gene insertion might occur.

delete_rate_g

This parameter can range between (0,1). It contains the probability that an individual gene deletion might occur.

recomb_rate_g

This parameter can range between (0,1). It contains the probability that an individual gene recombination might occur.

class_rate_g

This parameter can range between (0,1). It contains the probability that the class of a gene will mutate.

confidence_rate_g

This parameter can range between (0,1). It contains the probability that the confidence of a gene will mutate.

rand_child

This parameter is a whole number (integer >= 0). It can be used to generate random children with no parents at each generation. This can be used to add diversity to a population.

parent_count

This parameter is an integer >= 1. It can be used to control the number of parents required to generate children. E.G., if this parameter is 3, then 3 parents will be selected to generate 3 children, and this will continue until the required number of children has been generated.

lifetime

This parameter is an integer. It controls the maximum age before death for a chromosome. Once chromosomes reach this age, they will be deleted from the population. If this parameter is > 1, then chromosomes will only die from age, and not from poor fitness. By setting this property to -1, chromosomes can live forever, and will die if they don't have a good enough fitness.

pop_limit

This parameter is an integer. It controls the maximum amount of chromosomes allowed in a population. It can be used in conjunction with the lifetime parameter to limit the population size, as the population will most likely grow with each generation.

one_vs_all

This parameter is an integer. It is used to control the one-vs-all mode of the population. The value of the parameter is the class used for the "one". To deactivate this mode, set the parameter to -1.

seed

Can be any number. Used for repeatable random processes.