diff --git a/DEPLOY.md b/DEPLOY.md new file mode 100644 index 0000000..6caa80f --- /dev/null +++ b/DEPLOY.md @@ -0,0 +1,59 @@ +## How to deploy voxelgpt to Fiftyone Teams + +> **You must have the [contributor steps](README.md#contributing) completed before running the commands below** + +**Bump the version** + +Bump the version of the plugin by running: + +``` +# bumps the patch version (for bug fixes only) +yarn bump + +# sets the version +yarn bump -- 1.2.3 +``` + +**Commit all Files** + +Only files committed locally will be included in the plugin archive. + +This is also a good time to tag the new version. + +``` +VERSION=1.2.3 +git checkout -b release/$VERSION +git add . # files you want included +git commit -m 'release version $VERSION' # this will be in the output from the command above +git tag $VERSION +git push origin --follow-tags # push the commit and tags +``` + +**Create the Plugin Archive** + +``` +yarn archive +``` + +**Upload Archive to Teams** + +Goto [https://MY_FIFTYONE_TEAMS/settings/plugins](https://MY_FIFTYONE_TEAMS/settings/plugins). + +To install a new plugin, click "Install plugin". To upgrade an existing plugin, find it in the list and click the 3 dots and choose "Upgrade plugin". + +Upload the newly created archive. + +**Set your Permissions** + +Find the plugin in the list and click on "X operators". Select the appropriate permissions for your plugin. + +**That's it!** + +At this point you should have a newly installed/upgraded plugin. Users will see this change immediately. + +**Troubleshooting Tips** + +If you are seeing issues with a plugin not updating: + + - check the logs for any additional information + - restart the appropriate pods (if you have `teams-plugins` pods, those should be the only ones restarted, otherwise restart the `fiftyone-app` pods.) \ No newline at end of file diff --git a/README.md b/README.md index ac1107e..619551c 100644 --- a/README.md +++ b/README.md @@ -358,6 +358,27 @@ You can manually lint a file if necessary like so: pre-commit run --files ``` +**Developing and Building the Plugin JS Bundle** + +To build the Fiftyone plugin you must: + + - install `fiftyone` from source (including the app dependencies installed). [See here](https://github.com/voxel51/fiftyone/blob/develop/CONTRIBUTING.md) for details. + - the environment variable `FIFTYONE_DIR` set to the source directory of `fiftyone` + - `yarn@3.5.x` installed. + - installed the `voxelgpt` dependencies by running `yarn install` in the `voxelgpt` directory + +To create a build, run: + +```sh +# for a production build of the plugin js bundle +yarn build + +# for rebuilding the bundle automatically during development +yarn dev +``` + +> NOTE: when developing locally you must set `FIFTYONE_PLUGINS_DIR` to a directory containing the `voxelgpt` directory. + ## How does it work? VoxelGPT uses: @@ -394,6 +415,10 @@ The current implementation supports most FiftyOne certain stages like `concat()`, `mongo()`, and `geo_within()` are not yet supported. We're working on it! +### Deploying on FiftyOne Teams + +Instructions for deploying the plugin to FiftyOne Teams are [here](DEPLOY.md). + ## About FiftyOne If you've made it this far, we'd greatly appreciate if you'd take a moment to diff --git a/examples/viewstage_embeddings.pkl b/examples/viewstage_embeddings.pkl index 2ce05a6..c813912 100644 Binary files a/examples/viewstage_embeddings.pkl and b/examples/viewstage_embeddings.pkl differ diff --git a/examples/viewstage_examples.csv b/examples/viewstage_examples.csv index 545a9e6..5ac92c1 100644 --- a/examples/viewstage_examples.csv +++ b/examples/viewstage_examples.csv @@ -29,7 +29,7 @@ first 100 samples,[limit(100)],Jacob,all,FALSE,FALSE,FALSE,FALSE,FALSE,all Limit the view to 35,[limit(35)],Jacob,all,FALSE,FALSE,FALSE,FALSE,FALSE,all Only include samples that contain predictions with > 99% confidence,"[match_labels(filter=F('confidence') > 0.99, fields='predictions')]",Jacob,all,FALSE,FALSE,FALSE,FALSE,FALSE,all Only include samples that contain labels with ids hofwihuf or abxjhbvcie,"[match_labels(ids=[hofwihuf, abxjhbvcie])]",Jacob,all,FALSE,FALSE,FALSE,FALSE,FALSE,all -get samples with labels with the 'test tag,"[match_labels(tags='test')] +get samples with labels with the 'test tag,"[match_labels(tags='test')] ",Jacob,all,FALSE,FALSE,FALSE,FALSE,FALSE,all Only include samples that have the 'mistake' tag,[match_tags('mistake')],Jacob,all,FALSE,FALSE,FALSE,FALSE,FALSE,all Only include samples that do not have the 'validation' tag,"[match_tags('validation', bool=False)]",Jacob,all,FALSE,FALSE,FALSE,FALSE,FALSE,all @@ -107,7 +107,7 @@ display the whole dataset,[],Jacob,all,FALSE,FALSE,FALSE,FALSE,FALSE,all Exclude frame with id 'kbdskajdvfef',[exclude_frames(['kbdskajdvfef'])],Jacob,video,FALSE,FALSE,FALSE,FALSE,FALSE,all clip view with one clip per meeting,"[filter_labels('events', F('label') == 'meeting'), to_clips('events')]",Jacob,video,FALSE,FALSE,FALSE,FALSE,FALSE,all Create a clips view that contains one clip for each contiguous segment that contains at least one road sign in every frame,"[filter_labels('frames.detections', F('label') == 'road sign'), to_clips('frames.detections')]",Jacob,video,FALSE,FALSE,FALSE,FALSE,FALSE,detection -Create a trajectories view for the vehicles in the dataset,"[filter_labels('frames.detections', F('label') == 'vehicle'), +Create a trajectories view for the vehicles in the dataset,"[filter_labels('frames.detections', F('label') == 'vehicle'), to_trajectories('frames.detections')]",Jacob,video,FALSE,FALSE,FALSE,FALSE,FALSE,detection show 'vehicle' detections in the 'detections' field,"[filter_labels('frames.detections', F('label') == 'vehicle')]",Jacob,video,FALSE,FALSE,FALSE,FALSE,FALSE,detection "Create a frames view that only contains frames with at least 10 objects, sampled at a maximum frame rate of 1fps","[match_frames(F('detections.detections').length() > 10), to_frames(max_fps=1)]",Jacob,video,FALSE,FALSE,FALSE,FALSE,FALSE,detection @@ -150,8 +150,8 @@ Discard all predictions with confidence below 0.3,"[filter_labels('predictions', Only include classifications in the `predictions` field whose label is 'frog' or 'turtle',"[filter_labels('predictions', F('label').is_in(['frog', 'turtle']))]",Jacob,image,FALSE,FALSE,FALSE,FALSE,FALSE,all Only include polylines in the `faster-rcnn` field whose `label` is 'lane',"[filter_labels('faster-rcnn', F('label') == 'lane')]",Jacob,image,FALSE,FALSE,FALSE,FALSE,FALSE,all Only contains predictions whose bounding boxes' upper left corner is a Manhattan distance of at least 1 from the origin,"[filter_labels('predictions, F('bounding_box')[0] + F('bounding_box')[1] > 1)]",Jacob,image,FALSE,FALSE,FALSE,FALSE,FALSE,detection -"Create a view that only contains predictions whose bounding boxes -have area < 0.2 with confidence > 0.9, and only include samples with +"Create a view that only contains predictions whose bounding boxes +have area < 0.2 with confidence > 0.9, and only include samples with at least 15 such objects","[filter_labels('predictions', (bbox_area < 0.2) & (F('confidence') > 0.9)), match(F('predictions.detections').length() > 15)]",Jacob,image,FALSE,FALSE,FALSE,FALSE,FALSE,detection Only include detections in the `predictions` field whose bounding box is smaller than 0.2,"[filter_labels('predictions', F('bounding_box')[2] * F('bounding_box')[3] < 0.2)]",Jacob,image,FALSE,FALSE,FALSE,FALSE,FALSE,detection Only include polylines in the `predictions` field that are filled,"[filter_labels('predictions', F('filled') == True)]",Jacob,image,FALSE,FALSE,FALSE,FALSE,FALSE,polyline @@ -168,7 +168,7 @@ the first 30 samples with a plant,"[match(F('ground_truth.detections.label').con 10 random images with tables,"[match(F('ground_truth.detections.label').contains('table')), take(10)]",Jacob,image,FALSE,FALSE,FALSE,FALSE,FALSE,detection Contains a rabbit and a tortoise prediction,"[match(F('predictions.detections.label').contains(['rabbit', 'tortoise'], all=True))]",Jacob,image,FALSE,FALSE,FALSE,FALSE,FALSE,detection Contains a cat or mouse but not both,"[match(F('predictions.detections.label').contains(['cat', 'mouse']) & ~F('predictions.detections.label').contains(['cat', 'mouse'], all=True))]",Jacob,image,FALSE,FALSE,FALSE,FALSE,FALSE,detection -Only contains samples whose first and last prediction have the same label,"[match(F('predictions.detections')[0].apply(F('label')) == F('predictions.detections').reverse()[0].apply(F('label')))] +Only contains samples whose first and last prediction have the same label,"[match(F('predictions.detections')[0].apply(F('label')) == F('predictions.detections').reverse()[0].apply(F('label')))] ",Jacob,image,FALSE,FALSE,FALSE,FALSE,FALSE,detection unique and wrong,"[match(F('predictions.label') != F('ground_truth.label')), sort_by('uniqueness', reverse=True)]",Jacob,image,FALSE,FALSE,FALSE,FALSE,FALSE,classification fewer than 4 ground truth detections,[match(F('ground_truth.detections').length() < 10)],Jacob,image,FALSE,FALSE,FALSE,FALSE,FALSE,detection @@ -359,4 +359,16 @@ give me all the samples without a motorcycle prediction,"[match(!F(""pred.detect cars or pedestrians,"[filter_labels(""gt"", F(""label"").is_in([""car"", pedestrian""])]",Jacob,image,FALSE,FALSE,FALSE,FALSE,FALSE,all show me all the samples that were misclassified,"[match(F(""EVAL_KEY"") == False)]",Jacob,image,FALSE,FALSE,FALSE,TRUE,FALSE,classification I want any images that were correctly classified,"[match(F(""EVAL_KEY"") == True)]",Jacob,image,FALSE,FALSE,FALSE,TRUE,FALSE,classification -show me the first 10 incorrectly classified predictions,"[match(F(""EVAL_KEY"") == False), limit(10)]",Jacob,image,FALSE,FALSE,FALSE,TRUE,FALSE,classification \ No newline at end of file +show me the first 10 incorrectly classified predictions,"[match(F(""EVAL_KEY"") == False), limit(10)]",Jacob,image,FALSE,FALSE,FALSE,TRUE,FALSE,classification +show me object patches for airplanes in the predictions field,"[filter_labels(""predictions"", F(""label"") == ""airplane""), to_patches(""predictions"")]",Jacob,image,FALSE,FALSE,FALSE,FALSE,FALSE,detection +all of the model1 detection patches in the first image,"[limit(1), to_patches(""model1"")]",Jacob,image,FALSE,FALSE,FALSE,FALSE,FALSE,detection +I only want to see the high confidence object detections predicted by resnet,"[limit(10), filter_labels(""resnet"", F(""confidence"")>0.9), to_patches(""resnet"")]",Jacob,image,FALSE,FALSE,FALSE,FALSE,FALSE,detection +display all the carrot objects in the wandb_05_09 field,"[filter_labels(""wandb_05_09"", F(""label"")==""carrot""), to_patches(""wandb_05_09"")",Jacob,image,FALSE,FALSE,FALSE,FALSE,FALSE,detection +show me all the non-dog detections,"[filter_labels(""ground_truth"", F(""label"") != ""dog"")]",Jacob,image,FALSE,FALSE,FALSE,FALSE,FALSE,detection +only show samples that don't have a hat prediction,"[match(~F(""prediction.detections.label"").contains(""hat""))]",Jacob,image,FALSE,FALSE,FALSE,FALSE,FALSE,detection +animals that aren't horses,"[filter_labels(""ground_truth"", F(""label"") != ""horses""), sort_by_similarity(""animal"", brain_key = ""TEXT_SIM_KEY"", k = 25)]",Jacob,image,FALSE,TRUE,FALSE,FALSE,FALSE,all +samples classified as anything but a snail,"[match(~(F(""cls.label"") == ""snail""))]",Jacob,image,FALSE,FALSE,FALSE,FALSE,FALSE,classification +samples with 2 or more Cats,"[match(F(""gt.detections"").filter(F(""label"") == ""Cat"").length() >= 2)]",Jacob,image,FALSE,FALSE,FALSE,FALSE,FALSE,detection +all images with at least 6 plates,"[match(F(""detections.detections"").filter(F(""label"") == ""plate"").length() >= 6)]",Jacob,image,FALSE,FALSE,FALSE,FALSE,FALSE,detection +images with no more than four lamps or lanterns,"[match(F(""ground_truth.detections"").filter(F(""label"").is_in([""lamp"", ""lantern""])).length() <= 4)",Jacob,image,FALSE,FALSE,FALSE,FALSE,FALSE,detection +samples with exactly one prediction of a bed,"[match(F(""model.detections"").filter(F(""label"") == ""bed"").length() == 1)]",Jacob,image,FALSE,FALSE,FALSE,FALSE,FALSE,detection \ No newline at end of file diff --git a/fiftyone_api_embeddings.pkl b/fiftyone_api_embeddings.pkl new file mode 100644 index 0000000..295e85b Binary files /dev/null and b/fiftyone_api_embeddings.pkl differ diff --git a/fiftyone_docs_embeddings.pkl b/fiftyone_docs_embeddings.pkl index 3d01bef..d3f25ed 100644 Binary files a/fiftyone_docs_embeddings.pkl and b/fiftyone_docs_embeddings.pkl differ diff --git a/links/dataset_view_generator.py b/links/dataset_view_generator.py index 9d12054..1488e75 100644 --- a/links/dataset_view_generator.py +++ b/links/dataset_view_generator.py @@ -94,6 +94,10 @@ template=TEXT_SIMILARITY_PROMPT_TEMPLATE, ) +SIMILARITY_QUERY_PROMPT_PATH = os.path.join( + PROMPTS_DIR, "similarity_query_extractor_prompt.txt" +) + DETECTION_KEYWORDS = ( "_fp", "_fn", @@ -105,6 +109,17 @@ CLASSIFICATION_KEYWORDS = ("False", "True") +TEXT_SIM_KEYWORDS = ( + "show", + "display", + "find me", + "images", + "pictures", + "photos", + "videos", + "samples", +) + def generate_evaluation_prompt(sample_collection, eval_key): schema = sample_collection.get_field_schema() @@ -671,14 +686,21 @@ def _get_confidence_subfield(field): def _get_ground_truth_field(): return label_fields[0] + def _check_non_none_field(field, sample_collection): + + # return True if field (fully-qualified) exists and contains non-None value + v = sample_collection.limit(1).values(field)[0] + if isinstance(v, list): + v = v[0] + return v is not None + def _get_predictions_field(): if len(label_fields) == 1: return label_fields[0] for field in label_fields: conf_field = _get_confidence_subfield(field) - - if sample_collection.first()[conf_field]: + if _check_non_none_field(conf_field, sample_collection): return field return label_fields[0] @@ -743,6 +765,8 @@ def get_label_field(contents, used_classes): classes_str = f'F("label").is_in({class_strs})' contents = f"filter = {classes_str}{field_names_str}" + if ".label" in contents: + contents = contents.replace(".label", "") return f"match_labels({contents})" elif "is_in" in contents: is_in = contents.split("is_in([")[1].split("])")[0] @@ -752,8 +776,12 @@ def get_label_field(contents, used_classes): class_strs = [f"{class_name}" for class_name in elems] classes_str = f'F("label").is_in({class_strs})' contents = f"filter = {classes_str}{field_names_str}" + if ".label" in contents: + contents = contents.replace(".label", "") return f"match_labels({contents})" elif "filter=" in contents: + if ".label" in contents: + contents = contents.replace(".label", "") for field in label_classes.keys(): if f'F("{field}")' in contents: contents = contents.replace(f'F("{field}")', f'F("label")') @@ -801,9 +829,27 @@ def _validate_filter_labels(stage, label_classes): if label_field not in label_classes.keys(): for field in label_classes.keys(): if field in label_field and field != label_field: - contents = contents.replace(label_field, field) + contents = contents.replace(args[0], f'"{field}"') break + ##### correct three-argument hallucination of form 'filter_labels("label_field", "==", "class_name")' + eq_pattern = r'"([^"]+)",\s*"==",\s*"([^"]+)"' + eq_matches = re.findall(eq_pattern, contents) + if eq_matches: + match = eq_matches[0] + label_field = match[0] + class_name = match[1] + return f'filter_labels("{label_field}", F("label") == "{class_name}")' + + ##### correct three-argument hallucination of form 'filter_labels("label_field", "!=", "class_name")' + neq_pattern = r'"([^"]+)",\s*"!=",\s*"([^"]+)"' + neq_matches = re.findall(neq_pattern, contents) + if neq_matches: + match = neq_matches[0] + label_field = match[0] + class_name = match[2] + return f'filter_labels("{label_field}", F("label") != "{class_name}")' + ##### correct second argument if needed if len(args) == 2: arg1 = args[1].strip() @@ -900,6 +946,65 @@ def _validate_negation_operator(stage): return stage +def _get_false_patterns(stage): + false_patterns = [ + r",\s*False", + r",\s*invert\s*=\s*True", + ] + + if stage.startswith("match_labels"): + return false_patterns + elif stage.startswith("match_tags"): + return false_patterns + elif stage.startswith("exists"): + return false_patterns + else: + return false_patterns + [r",\s*bool\s*=\s*False"] + + +def _validate_bool_condition(stage): + false_patterns = _get_false_patterns(stage) + + for pattern in false_patterns: + false_matches = re.findall(pattern, stage) + if false_matches: + stage = re.sub(pattern, "", stage) + opening_paren_index = stage.index("(") + # Extract the function name + stage_name = stage[:opening_paren_index] + + # Extract the contents + contents = stage[opening_paren_index + 1 : -1] + return f"{stage_name}(~({contents}))" + return stage + + +def load_similarity_query_prompt(): + cache = get_cache() + key = "similarity_query_prefix" + if key not in cache: + with open(SIMILARITY_QUERY_PROMPT_PATH, "r") as f: + cache[key] = f.read() + return cache[key] + + +def extract_similarity_query(stage): + pattern = r'sort_by_similarity\("([^"]+)"' + query = re.search(pattern, stage).group(1) + sim_query_prompt = load_similarity_query_prompt().replace("QUERY", query) + new_query = get_llm().call_as_llm(sim_query_prompt).strip() + return stage.replace(query, new_query) + + +def _validate_text_similarity(stage): + if "sort_by_similarity" not in stage: + return stage + if any(keyword in stage for keyword in TEXT_SIM_KEYWORDS): + return extract_similarity_query(stage) + else: + return stage + + def _postprocess_stages( stages, sample_collection, @@ -926,8 +1031,10 @@ def _postprocess_stages( _stage = _validate_filter_labels(_stage, label_classes) if "match_labels" in _stage: _stage = _validate_match_labels(_stage, label_classes) - _stage = _validate_negation_operator(_stage) + _stage = _validate_negation_operator(_stage) + _stage = _validate_bool_condition(_stage) + _stage = _validate_text_similarity(_stage) new_stages.append(_stage) return new_stages diff --git a/links/docs_query_dispatcher.py b/links/docs_query_dispatcher.py index 603e4ae..47807e8 100644 --- a/links/docs_query_dispatcher.py +++ b/links/docs_query_dispatcher.py @@ -5,34 +5,38 @@ | `voxel51.com `_ | """ +from glob import glob +import numpy as np import os import pickle import uuid -from langchain.chains import RetrievalQA, RetrievalQAWithSourcesChain -from langchain.document_loaders import ( - DirectoryLoader, - UnstructuredMarkdownLoader, -) from langchain.schema import Document, BaseRetriever -from langchain.text_splitter import TokenTextSplitter import numpy as np from scipy.spatial.distance import cosine # pylint: disable=relative-beyond-top-level from .utils import ( + count_tokens, get_cache, get_embedding_function, - get_llm, stream_retriever, query_retriever, ) +from .markdown_utils import ( + get_markdown_documents, +) + ROOT_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__))) PROMPTS_DIR = os.path.join(ROOT_DIR, "prompts") DOCS_EMBEDDINGS_FILE = os.path.join(ROOT_DIR, "fiftyone_docs_embeddings.pkl") +API_DOCS_EMBEDDINGS_FILE = os.path.join( + ROOT_DIR, "fiftyone_api_embeddings.pkl" +) + PROMPT_TEMPLATE_FILE = os.path.join(PROMPTS_DIR, "docs_qa_template.txt") DOC_TYPES = ( @@ -56,6 +60,7 @@ "core.aggregations", "core.annotation", "core.brain", + "core.collections", "core.evaluation", "core.expressions", "core.frame", @@ -73,6 +78,15 @@ "release-notes.html", ) +BAD_PATTERNS = ( + "ts-api", + "detection_mistakenness", + "model_inference", + "label_mistakes", + "dataset_creation/zoo", + "dataset_creation/common_datasets", +) + def _make_api_doc_path(name, docs_dir): return os.path.join(docs_dir, "api", f"fiftyone.{name}.html") @@ -85,76 +99,108 @@ def _get_docs_build_dir(): return os.path.join(fo_repo_dir, "docs", "build", "html") -def _get_url_from_path(path): +def _get_url(path, anchor): rel_path = "".join(path.split("html")[-2:]) - return f"https://docs.voxel51.com{rel_path}html" - + page_url = f"https://docs.voxel51.com{rel_path}html" + if anchor: + anchor = ".".join(anchor.split(".")[:4]) + return page_url + "#" + anchor + else: + return page_url -def _generate_docs_embeddings(): - """Generates embeddings for the FiftyOne documentation. - This is a developer method that only needs to be run once after each - release. It requires a source install of FiftyOne with the fresh docs - build. - """ - - all_embeddings_dict = {} - - def add_loader_embeddings(loader, all_embeddings_dict, chunk_size=200): - documents = loader.load() - text_splitter = TokenTextSplitter( - chunk_size=chunk_size, chunk_overlap=0 +def _generate_file_embeddings(filepath): + model = get_embedding_function() + md_docs_dict = get_markdown_documents(filepath) + + ids = [] + contents = [] + sources = [] + for anchor, section in md_docs_dict.items(): + ids.append(str(uuid.uuid1())) + source = _get_url(filepath, anchor) + for chunk in section: + contents.append(chunk.page_content) + sources.append(source) + + embeddings = model(contents) + curr_embeddings_dict = { + id: {"content": content, "embedding": embedding, "source": source} + for id, content, embedding, source in zip( + ids, contents, embeddings, sources ) - texts = text_splitter.split_documents(documents) + } - ids = [str(uuid.uuid1()) for _ in texts] - contents = [text.page_content for text in texts] - sources = [ - _get_url_from_path(text.metadata["source"]) for text in texts - ] - embeddings = model(contents) + return curr_embeddings_dict - curr_embeddings_dict = { - id: {"content": content, "embedding": embedding, "source": source} - for id, content, embedding, source in zip( - ids, contents, embeddings, sources - ) - } - - all_embeddings_dict = {**all_embeddings_dict, **curr_embeddings_dict} - return all_embeddings_dict +def _get_docs_path_list(): docs_dir = _get_docs_build_dir() - model = get_embedding_function() - # STANDALONE DOCS - for filename in STANDALONE_DOCS: - filepath = os.path.join(docs_dir, filename) - loader = UnstructuredMarkdownLoader(filepath) - all_embeddings_dict = add_loader_embeddings( - loader, all_embeddings_dict - ) + doc_paths = [] + + ### add standalone docs + standalone_paths = [ + os.path.join(docs_dir, filename) for filename in STANDALONE_DOCS + ] + doc_paths.extend(standalone_paths) - # DOCS CATEGORIES + ### add remaining types of docs for doc_type in DOC_TYPES: - print(f"Generating embeddings for {doc_type}...") doc_type_dir = os.path.join(docs_dir, doc_type) - loader = DirectoryLoader(doc_type_dir, glob="**/*.html") - all_embeddings_dict = add_loader_embeddings( - loader, all_embeddings_dict + doc_paths.extend( + glob(os.path.join(doc_type_dir, "*.html"), recursive=True) ) - - # API DOCS - for api_doc_path in API_DOC_PATHS: - doc = _make_api_doc_path(api_doc_path, docs_dir) - loader = UnstructuredMarkdownLoader(doc) - all_embeddings_dict = add_loader_embeddings( - loader, all_embeddings_dict + doc_paths.extend( + glob(os.path.join(doc_type_dir, "*/*.html"), recursive=True) ) + good_doc_paths = [] + for doc_path in doc_paths: + if any([bad_patt in doc_path for bad_patt in BAD_PATTERNS]): + continue + else: + good_doc_paths.append(doc_path) + + ### add api docs + api_paths = [ + _make_api_doc_path(api_doc_path, docs_dir) + for api_doc_path in API_DOC_PATHS + ] + # doc_paths.extend(api_paths) + + return api_paths, good_doc_paths + + +def _generate_docs_embeddings(): + """Generates embeddings for the FiftyOne documentation. + + This is a developer method that only needs to be run once after each + release. It requires a source install of FiftyOne with the fresh docs + build. + """ + + all_embeddings_dict = {} + + api_paths, doc_paths = _get_docs_path_list() + + for doc_path in doc_paths: + print(doc_path) + curr_embeddings_dict = _generate_file_embeddings(doc_path) + all_embeddings_dict.update(curr_embeddings_dict) + with open(DOCS_EMBEDDINGS_FILE, "wb") as f: pickle.dump(all_embeddings_dict, f) + api_embeddings_dict = {} + for api_path in api_paths: + print(api_path) + curr_embeddings_dict = _generate_file_embeddings(api_path) + api_embeddings_dict.update(curr_embeddings_dict) + + with open(API_DOCS_EMBEDDINGS_FILE, "wb") as f: + pickle.dump(api_embeddings_dict, f) + class FiftyOneDocsRetriever(BaseRetriever): def __init__(self, embeddings): @@ -177,7 +223,15 @@ def get_relevant_documents(self, query): ) sorted_ix = np.argsort(dists).astype(int) - return [self.contents[ix] for ix in sorted_ix[:10]] + relevant_docs = [self.contents[ix] for ix in sorted_ix[:10]] + lens = [count_tokens(doc.page_content) for doc in relevant_docs] + cumsums = np.cumsum(lens) + cutoff = 3200 + if cumsums[-1] > cutoff: + relevant_docs = [ + doc for doc, cs in zip(relevant_docs, cumsums) if cs < cutoff + ] + return relevant_docs async def aget_relevant_documents(self, query: str): raise NotImplementedError @@ -200,6 +254,9 @@ def _create_docs_retriever(): with open(DOCS_EMBEDDINGS_FILE, "rb") as f: embeddings = list(pickle.load(f).values()) + with open(API_DOCS_EMBEDDINGS_FILE, "rb") as f: + embeddings.extend(list(pickle.load(f).values())) + return FiftyOneDocsRetriever(embeddings) diff --git a/links/markdown_utils.py b/links/markdown_utils.py new file mode 100644 index 0000000..ddf79c6 --- /dev/null +++ b/links/markdown_utils.py @@ -0,0 +1,313 @@ +from markdownify import markdownify +import re +from langchain.schema import Document +from langchain.text_splitter import MarkdownTextSplitter + +SPLITTER = MarkdownTextSplitter(chunk_size=1000) + +ATTR_API_DOCS = ( + "session", + "label", + "collection", + "sample", + "spaces", + "stages", + "view", +) + +METHOD_API_DOCS = ( + "expression", + "session", + "frame", + "sample", + "video", + "view", + "collection", +) + + +def remove_footer(page_md): + return page_md.split("[Next ![]")[0] + + +def remove_header(page_md): + md_lines = page_md.split("\n") + + body_lines = [] + in_body = False + for mdl in md_lines: + if len(mdl) > 0 and mdl[0] == "#": + in_body = True + if in_body: + body_lines.append(mdl) + page_md = "\n".join(body_lines) + return page_md + + +def remove_extra_newlines(page_md): + lines = page_md.split("\n") + lines = [line for line in lines if line.strip() != "!"] + page_md = "\n".join(lines) + page_md = re.sub(r"\n{3,}", "\n\n", page_md) + return page_md + + +def remove_empty_code_blocks(page_md): + text_and_code = page_md.split("```") + text_blocks = text_and_code[::2] + code_blocks = text_and_code[1::2] + code_blocks = [ + cb + for cb in code_blocks + if len(cb.strip()) > 0 and not set(cb).issubset(set("| -\n")) + ] + + page_md = "" + for tb, cb in zip(text_blocks, code_blocks): + page_md += tb + "```" + cb + "```" + page_md += text_and_code[-1] + page_md = re.sub(r"```py\s*```", "", page_md, flags=re.MULTILINE) + return page_md + + +def remove_jupyter_widgets(page_md): + lines = page_md.split("\n") + lines = [ + line + for line in lines + if len(line) == 0 or (line[0] != "{" and "jupyter-widgets" not in line) + ] + return "\n".join(lines) + + +def remove_xml(page_md): + lines = page_md.split("\n") + lines = [line for line in lines if not line.startswith(" 0 and not set(cb).issubset(set("| -")) + ] + + page_md = "" + for tb, cb in zip(text_blocks, code_blocks): + page_md += tb + "```py" + cb + "```" + page_md += text_and_code[-1] + return page_md + + +def merge_adjacent_code_blocks(page_md): + pattern = r"```\n```py" + page_md = re.sub(pattern, "", page_md) + page_md = re.sub(r"```py\n```py", r"```py", page_md) + return page_md + + +def remove_bad_elements(page_md): + pattern = r"\(function\(\) {[\s\S]*?}\)\(\);" + page_md = re.sub(pattern, "", page_md, flags=re.MULTILINE) + + lines = page_md.split("\n") + lines = [line for line in lines if not line.startswith("@import")] + + bad_keywords = [ + "#focontainer", + "#fooverlay", + "#foactivate", + ] + + good_lines = [] + flag = True + for line in lines: + if any([keyword in line for keyword in bad_keywords]): + flag = False + if flag: + good_lines.append(line) + if "}" in line and not flag: + flag = True + + page_md = "\n".join(good_lines) + return page_md + + +def reformat_markdown(page_md): + page_md = page_md.replace("\_", "_").replace("\*", "*") + page_md = remove_links(page_md) + page_md = remove_images(page_md) + page_md = remove_jupyter_widgets(page_md) + page_md = remove_xml(page_md) + page_md = remove_extra_newlines(page_md) + page_md = remove_bad_elements(page_md) + page_md = remove_code_cell_vestiges(page_md) + return page_md + + +def parse_page_markdown(page_md): + page_md = remove_header(page_md) + page_md = remove_footer(page_md) + page_md = remove_line_numbers(page_md) + page_md = remove_table_rows(page_md) + page_md = remove_empty_code_blocks(page_md) + page_md = add_syntax_highlight_to_code_blocks(page_md) + page_md = merge_adjacent_code_blocks(page_md) + + ### reformat now that the markdown is clean + page_md = reformat_markdown(page_md) + page_md = remove_empty_code_blocks(page_md) + page_md = remove_extra_newlines(page_md) + return page_md + + +def _is_attr_type(filepath): + return any([ad in filepath for ad in ATTR_API_DOCS]) + + +def _is_method_type(filepath): + return any([md in filepath for md in METHOD_API_DOCS]) + + +def preprocess_api_markdown(page_md, filepath): + flag = True + md_lines = page_md.split("\n") + new_lines = [] + + METHOD_FLAG = _is_method_type(filepath) + ATTR_FLAG = _is_attr_type(filepath) + + for line in md_lines: + if line.startswith("**Methods:**"): + if METHOD_FLAG: + flag = True + new_lines.append(line) + else: + flag = False + elif line.startswith("**Attributes:**"): + if ATTR_FLAG: + flag = True + new_lines.append(line) + else: + flag = False + elif line.startswith("*property*"): + flag = True + new_lines.append(line.split("[")[0]) + elif line.startswith("*class*"): + flag = True + new_lines.append(line) + elif line.startswith("*classmethod*"): + flag = False + elif "Permalink to this definition" in line: + new_lines.append(line.split("[")[0]) + elif "my_metaclass" in line: + continue + elif flag: + new_lines.append(line) + + return "\n".join(new_lines) + + +def get_page_markdown(filepath): + with open(filepath) as f: + page_html = f.read() + page_md = markdownify(page_html, heading_style="ATX") + page_md = parse_page_markdown(page_md) + if "api" in filepath: + page_md = preprocess_api_markdown(page_md, filepath) + return page_md + + +def split_at_anchors(page_md): + md_lines = page_md.split("\n") + md_sections = {} + curr_anchor = None + curr_section = [] + for line in md_lines: + if "Permalink" in line: + if curr_anchor is not None: + md_sections[curr_anchor] = "\n".join(curr_section) + curr_section = [] + curr_anchor = line.split('"Permalink')[0].split("#")[-1].strip() + else: + curr_section.append(line) + md_sections[curr_anchor] = "\n".join(curr_section) + return md_sections + + +def split_section_into_chunks(text): + document = Document(page_content=text) + chunks = SPLITTER.split_documents([document]) + return chunks + + +def split_page_into_chunks(page_md): + md_sections = split_at_anchors(page_md) + chunks = {} + for anchor, section in md_sections.items(): + chunks[anchor] = split_section_into_chunks(section) + return chunks + + +def get_markdown_documents(filepath): + page_md = get_page_markdown(filepath) + chunks = split_page_into_chunks(page_md) + return chunks diff --git a/links/synthetic_example_generator.py b/links/synthetic_example_generator.py new file mode 100644 index 0000000..20c595e --- /dev/null +++ b/links/synthetic_example_generator.py @@ -0,0 +1,234 @@ +from copy import copy +import fiftyone as fo +import random +import re + + +def get_date_fields(dataset): + fields = dataset.get_field_schema(flat=True) + return [ + field_name + for field_name, field in fields.items() + if type(field) == fo.core.fields.DateField + ] + + +def get_datetime_fields(dataset): + fields = dataset.get_field_schema(flat=True) + return [ + field_name + for field_name, field in fields.items() + if type(field) == fo.core.fields.DateTimeField + ] + + +def get_string_fields(dataset): + fields = dataset.get_field_schema(flat=True) + return [ + field_name + for field_name, field in fields.items() + if type(field) == fo.core.fields.StringField + and "." not in field_name + and field_name != "filepath" + ] + + +def get_label_fields(dataset): + det_fields = [] + classification_fields = [] + classifications_fields = [] + sample = dataset.first() + for field_name, field in sample.iter_fields(): + if type(field) == fo.core.labels.Detections: + det_fields.append(field_name) + elif type(field) == fo.core.labels.Classification: + classification_fields.append(field_name) + elif type(field) == fo.core.labels.Classifications: + classifications_fields.append(field_name) + return det_fields, classification_fields, classifications_fields + + +def get_dataset_field_types(dataset): + string_fields = get_string_fields(dataset) + date_fields = get_date_fields(dataset) + datetime_fields = get_datetime_fields(dataset) + ( + det_fields, + classification_fields, + classifications_fields, + ) = get_label_fields(dataset) + + return { + "string_fields": string_fields, + "date_fields": date_fields, + "datetime_fields": datetime_fields, + "det_fields": det_fields, + "classification_fields": classification_fields, + "classifications_fields": classifications_fields, + } + + +BASE_FIELD_PATTERNS = [ + { + "query": "Exclude the {FIELD} field from all samples", + "stages": "[exclude_fields('{FIELD}')]", + }, + { + "query": "Only show samples with field {FIELD}", + "stages": "[exists('{FIELD}')]", + }, + { + "query": "Just show field {FIELD}", + "stages": "[select_fields('{FIELD}')]", + }, +] + +STRING_PATTERNS = [ + { + "query": "exclude samples with {FIELD} in {VALUE_LIST}", + "stages": "[exclude_by('{FIELD}', {VALUE_LIST})]", + }, + { + "query": "Images where {FIELD} is {VALUE}", + "stages": "[match(F('{FIELD}') == '{VALUE}')]", + }, + { + "query": "Only images that have {FIELD} not equal to {VALUE}", + "stages": "[match(F('{FIELD}') != '{VALUE}')]", + }, + { + "query": "Show me all the images where {FIELD} ends with s", + "stages": "[match(F('{FIELD}').ends_with('s'))]", + }, + { + "query": " images where {FIELD} is {VALUE1} or {VALUE2}", + "stages": "[match(F('{FIELD}').is_in(['{VALUE1}', '{VALUE2}']))]", + }, +] + +DATE_PATTERNS = [ + { + "query": "Show me all the images with {FIELD} before {VALUE}", + "stages": "[match(F('{FIELD}') < {VALUE})]", + }, + { + "query": "Show me all the images with {FIELD} after {VALUE}", + "stages": "[match(F('{FIELD}') > {VALUE})]", + }, + { + "query": "Samples where {FIELD} is in February", + "stage": "[match(F('{FIELD}').month() == 2)]", + }, + { + "query": "Samples that have 1988 for {FIELD}", + "stage": "[match(F('{FIELD}').year() == 1988)]", + }, + { + "query": "Any images with {FIELD} in the last 10 years", + "stage": "[match(F('{FIELD}').year() > 2013)]", + }, + { + "query": "All of the samples with {FIELD} first five days of the month", + "stage": "[match(F('{FIELD}').day_of_month() < 6)]", + }, +] + +DATETIME_PATTERNS = DATE_PATTERNS + [ + { + "query": "Images where the minute for {FIELD} equals 30", + "stage": "[match(F('{FIELD}').minute() == 30)]", + }, + { + "query": "Images with field {FIELD} after 6pm", + "stage": "[match(F('{FIELD}').hour() > 18)]", + }, + { + "query": "Display the samples where {FIELD} has millisecond of 3 or 4", + "stage": "[match(F('{FIELD}').millisecond().is_in([3, 4]))]", + }, +] + + +class FieldExampleGenerator(object): + """Base class for generating synthetic examples.""" + + def __init__(self, dataset, field_name): + self.dataset = dataset + self.field_name = field_name + self.field_values = self.get_field_values() + self.filters = { + "geo": False, + "text_sim": False, + "image_sim": False, + "eval": False, + "metadata": False, + "label_types": ["all"], + } + self.patterns = {"FIELD": self.generate_field_value} + self.example_templates = BASE_FIELD_PATTERNS + + def generate_field_value(self): + return self.field_name + + def select_example_template(self): + return copy(random.choice(self.example_templates)) + + def generate_custom_example(self): + template = self.select_example_template() + return self.fill_template(template) + + def get_patterns_to_replace(self, template): + return list(set(re.findall(r"\{.*?\}", template["query"]))) + + def generate_pattern_replacement(self, pattern): + replacement_func = self.patterns[pattern[1:-1]] + return replacement_func() + + def generate_replacements(self, patterns_to_replace): + replacements = {} + for pattern in patterns_to_replace: + replacement = self.generate_pattern_replacement(pattern) + replacements[pattern] = replacement + return replacements + + def fill_template(self, template): + patterns_to_replace = self.get_patterns_to_replace(template) + replacements = self.generate_replacements(patterns_to_replace) + + query = template["query"] + stages = template["stages"] + for pattern, replacement in replacements.items(): + query = query.replace(pattern, replacement) + stages = stages.replace(pattern, replacement) + template["query"] = query + template["stages"] = stages + return template + + def get_field_values(self): + raise NotImplementedError() + + def generate_examples(self, num_examples): + examples = [ + self.generate_custom_example() for _ in range(num_examples) + ] + return examples + + +class StringFieldExampleGenerator(FieldExampleGenerator): + def __init__(self, dataset, field_name): + super().__init__(dataset, field_name) + self.example_templates = BASE_FIELD_PATTERNS + STRING_PATTERNS + self.patterns["VALUE_LIST"] = self.generate_value_list + self.patterns["VALUE"] = self.generate_value + self.patterns["VALUE1"] = self.generate_value + self.patterns["VALUE2"] = self.generate_value + + def get_field_values(self): + return self.dataset.distinct(self.field_name) + + def generate_value(self): + return random.choice(self.field_values) + + def generate_value_list(self): + num_values = random.randint(2, 6) + return str(random.sample(self.field_values, num_values)) diff --git a/links/utils.py b/links/utils.py index cc4f88d..3b6fe68 100644 --- a/links/utils.py +++ b/links/utils.py @@ -8,6 +8,7 @@ import hashlib import os import re +import tiktoken import threading import queue @@ -45,6 +46,22 @@ def hash_query(query): return hash_object.hexdigest() +def get_tokenizer(): + encoding = tiktoken.encoding_for_model("gpt-3.5-turbo") + + def tokenizer(text): + return encoding.encode(text) + + return tokenizer + + +def count_tokens(text): + tokenizer = get_tokenizer() + tokens = tokenizer(text) + num_tokens = len(tokens) + return num_tokens + + def get_cache(): g = globals() if "_voxelgpt" not in g: diff --git a/links/view_stage_example_selector.py b/links/view_stage_example_selector.py index 589e175..0851415 100644 --- a/links/view_stage_example_selector.py +++ b/links/view_stage_example_selector.py @@ -38,7 +38,6 @@ def get_or_create_embeddings(queries): else: example_embeddings = {} - query_embeddings = [] query_hashes = [] new_hashes = [] new_queries = [] @@ -58,16 +57,13 @@ def get_or_create_embeddings(queries): for key, embedding in zip(new_hashes, new_embeddings): example_embeddings[key] = embedding - for key in query_hashes: - query_embeddings.append(example_embeddings[key]) - if new_queries: print("Saving embeddings to disk...") with open(EXAMPLE_EMBEDDINGS_PATH, "wb") as f: pickle.dump(example_embeddings, f) - return query_embeddings + return example_embeddings def has_geo_field(sample_collection): @@ -158,8 +154,8 @@ def _load_examples(): ) examples["hash"] = examples["query"].apply(lambda x: hash_query(x)) - with open(EXAMPLE_EMBEDDINGS_PATH, "rb") as f: - embeddings = pickle.load(f) + queries = examples["query"].tolist() + embeddings = get_or_create_embeddings(queries) embeddings = { key: np.array(embeddings[key]) for key in examples["hash"].tolist() diff --git a/prompts/docs_qa_template.txt b/prompts/docs_qa_template.txt index 0bf9066..9058f83 100644 --- a/prompts/docs_qa_template.txt +++ b/prompts/docs_qa_template.txt @@ -1,15 +1,22 @@ You are an educator and developer advocate, and your goal is to help users of the open source computer vision library FiftyOne to understand the FiftyOne library, its query language, and its functionality. -Given the following extracted segments from the FiftyOne library's documentation and a question, create a final answer to the user's question. Do not include the user's question in your answer. ALWAYS include references at the end of your answer formatted as a list similar to this: +Given extracted segments from the FiftyOne library's documentation and a question, your task is to create a final answer to the user's question. Do not include the user's question in your answer. ALWAYS include references at the end of your answer formatted as a list similar to this: Sources: - source 1 - source 2 -Here are the rules: +Here are some more rules: - If you don't know the answer, just say that you don't know. Don't try to make up an answer. - You will be rewarded for including examples, code snippets, and being as helpful as possible. -- When you use a code snippet, make sure to start the code block with ```py so that it is syntax highlighted. +- When you use inline code, surround it with backticks like this: `code`. This should be used for function names, method names, and variable names. +- When you use a code snippet, code block, or any other code, you MUST start the code block with ```py so that it is syntax highlighted, instead of just ```. +- Do NOT include links in the body of your answer. Only include links in the sources section. +- In the sources section, you can ONLY include links that start with https://docs.voxel51.com/. +- All parts of your response must be relevant to the question, and must be factually correct. You will be penalized if you mention somethine in your response that is not relevant to the question. + + +Given the following question and segments from the FiftyOne library's documentation, write a helpful answer: QUESTION: {question} ========= diff --git a/prompts/similarity_query_extractor_prompt.txt b/prompts/similarity_query_extractor_prompt.txt new file mode 100644 index 0000000..5a4c663 --- /dev/null +++ b/prompts/similarity_query_extractor_prompt.txt @@ -0,0 +1,32 @@ +In the computer vision library FiftyOne, an text_similarity run determines determines how similar each image is to a user-specified input text prompt. + +For example, if the user specifies the prompt "a dog", the run will return the most dog-like images in the dataset. If the user specifies the prompt "a dog in a field", the run will return the most dog-like images in a field. + +However, not all user-input prompts are immediately ready to be used in a text_similarity run. For instance, if a user specifies the prompt "show me images of a rainy day", the "show me" part of the prompt is not relevant to the run. + +Your task is to generate the effective prompt intended by the user to be used in a text_similarity run. + +Here are the rules: +1. You must response with a prompt that is a substring of the input query, potentially with some words removed, replaced, reordered, or changed in form/tense. +2. You must be as helpful as possible to the user. For example, if the user specifies the prompt "a dog in a field", you should not return "a dog" as the prompt, because that would not be as helpful as possible to the user. +3. You must be as specific as possible to the user. For example, if the user specifies the prompt "a dog in a field", you should not return "a dog in a field" as the prompt, because that would not be as specific as possible to the user. + +Generate the effective prompt intended by the user to be used in a text_similarity run for the following input queries: + +input: the bluest sky +output: a blue sky + +input: show me the blurriest images +output: blurry + +input: I want to see all of the images that look like a person +output: a person + +input: display the most colorful images +output: colorful + +input: rainy and cloudy samples +output: rainy and cloudy + +input: QUERY +output: \ No newline at end of file diff --git a/requirements.txt b/requirements.txt index f5b9e1c..8d0b9f2 100644 --- a/requirements.txt +++ b/requirements.txt @@ -2,3 +2,4 @@ langchain>=0.0.179 openai pandas scipy +tiktoken