-
Notifications
You must be signed in to change notification settings - Fork 1
Feature Creation QPP Reference lists
The feature creation described here is referring to the code file query_features.py
The feature creation occurs in case that generate variable is True.
The process is as follows:
- An instance of the class
QueryFeatureFactory(corpus, queries_group, quantile)is created for the given corpus, queries group and variations quantile. - The instance method is than used to generate the features DataFrame
testing_feat.generate_features()
The method consists of a call to self._calc_features() which calculates the similarity scores and returns a DataFrame with the relevant (raw) values, and call to a normalization function.
The currently used normalization function is self._sum_scores(_df) which basically divides the similarity values by the sum of all the similarities of the variations for a topic.
The method calculates the similarity of the query variations to the topic query (the query that represents the topic), which is stored in the instance object named self.topics_data.
It iterates over the queries in self.topics_data and for each one of them iterates over all the relevant variations (query variations, from the same topic) stored in the object self.queries_data.
For each variation it calculates and stores all the similarity measures to the topic query.
The method returns a DataFrame with the next columns:
'topic', 'qid', 'Jac_coefficient', 'Top_10_Docs_overlap', 'RBO_EXT_100', 'RBO_FUSED_EXT_100'
- topic - has a regular QID format and it represents the topic id
- qid - the query id of the UQV variation
The method returns a DataFrame only with the query ids that exist in self.variations_data.queries_df
The self.variations_data object consists only of the variations (w/o the topic queries)
That method is used to remove the topic queries from a given DataFrame
The method calculates the sum of every column for each topic, and divides the values in each column by the sum of the relevant topic.
The method returns a sum normalized DataFrame, if the _filter_queries(df) line is un-commented the normalization process will occur after the removal of the topic queries which will result in a legal probability distribution over the query variations of the topic.