Skip to content

Feature Creation QPP Reference lists

Oleg Zendel edited this page Dec 31, 2018 · 5 revisions

The feature creation described here is referring to the code file query_features.py

The feature creation occurs in case that generate variable is True. The process is as follows:

  1. An instance of the class QueryFeatureFactory(corpus, queries_group, quantile) is created for the given corpus, queries group and variations quantile.
  2. The instance method is than used to generate the features DataFrame testing_feat.generate_features()

generate_features()

The method consists of a call to self._calc_features() which calculates the similarity scores and returns a DataFrame with the relevant (raw) values, and call to a normalization function. The currently used normalization function is self._sum_scores(_df) which basically divides the similarity values by the sum of all the similarities of the variations for a topic.

_calc_features(self)

The method calculates the similarity of the query variations to the topic query (the query that represents the topic), which is stored in the instance object named self.topics_data. It iterates over the queries in self.topics_data and for each one of them iterates over all the relevant variations (query variations, from the same topic) stored in the object self.queries_data. For each variation it calculates and stores all the similarity measures to the topic query. The method returns a DataFrame with the next columns: 'topic', 'qid', 'Jac_coefficient', 'Top_10_Docs_overlap', 'RBO_EXT_100', 'RBO_FUSED_EXT_100'

  • topic - has a regular QID format and it represents the topic id
  • qid - the query id of the UQV variation

_filter_queries(self, df)

The method returns a DataFrame only with the query ids that exist in self.variations_data.queries_df The self.variations_data object consists only of the variations (w/o the topic queries) That method is used to remove the topic queries from a given DataFrame

_sum_scores(self, df)

The method calculates the sum of every column for each topic, and divides the values in each column by the sum of the relevant topic. The method returns a sum normalized DataFrame, if the _filter_queries(df) line is un-commented the normalization process will occur after the removal of the topic queries which will result in a legal probability distribution over the query variations of the topic.

Clone this wiki locally