diff --git a/README.md b/README.md index e57375e..a96d772 100644 --- a/README.md +++ b/README.md @@ -1,18 +1,20 @@ # ramblebot -A project to exercise Java, JUnit, git, GitHub, and code-reading skills. Students will create a language model to generate text. +A project to exercise Java, JUnit, git, GitHub, and code-reading skills. Students will create a language model to generate text. Change has been made ## Expectations ### Academic Honesty THIS IS AN INDIVIDUAL PROJECT. The following is not allowed: + - You MAY NOT copy any code from an AI. - You MAY NOT paste any of the project or your code into an AI. - You MAY NOT copy another student's code. - You MAY NOT copy substantial portions of your solution from the internet. You may: + - You are allowed to talk about the project generally with other students. - You are allowed to get help from tutors, so long as you write all the code and they do not walk you through step by step. - You are allowed to get help in office hours. @@ -23,37 +25,38 @@ You may: YOU ARE EXPECTED TO MAKE SMALL, FREQUENT COMMITS. Doing so is good practice and helps me see that it's less likely you pasted in a large part of your solution from elsewhere. ### Timeline + This is a large, difficult project. Start early, and get help when you need it. ## Setup 1. Fork and clone this project. MAKE SURE TO CLONE FROM YOUR FORK. The clone URL should have your username in it. 1. Change into the project directory: - ``` - cd ramblebot - ``` + ``` + cd ramblebot + ``` 1. Open the project in VS Code. - ``` - code . - ``` - If the above command does not work, you can open VS Code manually and select the ramblebot folder to open. + ``` + code . + ``` + If the above command does not work, you can open VS Code manually and select the ramblebot folder to open. 1. Open `RambleApp.java`. Click anywhere in the text of the file 1. Scroll to the bottom to find the `main` method. There should be a small grey "run" button above it. Click "Run". -![Run Button in VS Code](images/run_button.png) -Sometimes this button takes a little bit to show up when you first open VS Code. If you're not seeing it, make sure you have the Java extension pack installed and it is active. + ![Run Button in VS Code](images/run_button.png) + Sometimes this button takes a little bit to show up when you first open VS Code. If you're not seeing it, make sure you have the Java extension pack installed and it is active. 1. It should ask you for a filename. Give it the following filename: - ``` - wikipediaData.txt - ``` - Then hit enter. + ``` + keatsTraining.txt + ``` + Then hit enter. 1. It should ask you for a number of words. Enter a positive integer and hit enter. 1. You should expect to see an error message. This is good! The error message should end like this: - ``` - No tokens returned from tokenizer! - This is probably because you haven't implemented it yet - Begin with Wave 1 in the instructions, and implement LowercaseSentenceTokenizer - If you have implemented it, there's a bug in your code where it's returning null for the tokens. - ``` + ``` + No tokens returned from tokenizer! + This is probably because you haven't implemented it yet + Begin with Wave 1 in the instructions, and implement LowercaseSentenceTokenizer + If you have implemented it, there's a bug in your code where it's returning null for the tokens. + ``` 1. Open the testing side panel by clicking on the beaker on the left of your screen. ![Test Runner Sidebar in VS Code](images/test_runner.png) 1. Hover over `ramblebot`. A few grey triangles should appear. Click the triangle the furthest to the left. 1. You should expect to see all the tests fail. This is good! You haven't written your solution yet, so it's expected for them to fail. @@ -64,11 +67,15 @@ Sometimes this button takes a little bit to show up when you first open VS Code. The goal of this project is to make a bot that can generate new text in the style of some writer. It will do this by reading some input writing, and then word-by-word generating new text that mimics it. ## Wave 1 + In wave 1, you will start implementing `tokenize` in `LowercaseSentenceTokenizer.java`. The goal is to take a scanner, read through it, and return a list of words that were separated by spaces/newlines. For example, if the scanner had the following text: + ``` this is a lowercase sentence without a period ``` + Then tokenize should return a list that looks like this: + ``` ["this", "is", "a", "lowercase", "sentence", "without", "a", "period"] ``` @@ -76,22 +83,29 @@ Then tokenize should return a list that looks like this: I recommend not yet worrying about periods or capitalization. You will improve your code in later waves to handle this. `testTokenizeWithNoCapitalizationOrPeriod` in `LowercaseSentenceTokenizerTest` will exercise this functionality. The other tests will likely still fail. This is OK! You'll tackle them in later waves. Add commit and push your code if you have not already! ## Wave 2 -In wave 2, you will add your own test. You should test that your code properly handles input with many spaces. For example, something like: + +In wave 2, you will add your own test. You should test that your code properly handles input with many spaces. For example, something like: + ``` hello hi hi hi hello hello ``` + Write your test in `LowercaseSentenceTokenizerTest` where indicated by the comment, and verify it passes. Fix any bugs in your code if you find them. Add commit and push your code if you have not already! ## Wave 3 + In wave 3, you will finish the implementation of `tokenize`. Read the Javadoc carefully to understand what to do. Successfully completing this wave should make the remaining tests in `LowercaseSentenceTokenizerTest` pass. Add, commit, and push your code! ## Wave 4 + In wave 4 you will finish the implementation of `train` in `UnigramWordPredictor`. There is already one line of the implementation provided for you, you do not need to change it. Write the rest of te implementation below it. Read the Javadoc on `train` carefully to understand what is expected. I also recommend reading `testTrainAndGetNeighborMap` in `UnigramWordPredictorTest` as it gives an example input/output. Successfully completing your code should make that test pass. This method is probably the hardest part of the project. Look back at compound data structures and ask for help from tutors or in office hours if needed. Make sure to add, commit, and push your code frequently! ## Wave 5 + In wave 5 you will implement `predictNextWord` in `UnigramWordPredictor`. As part of implementing this, you will need to research how to generate random numbers in Java. Read the Javadoc carefully, and read the comments on the remaining tests. Once you complete this wave, all tests in the project should pass. ## Wave 6 + In wave 6 you will validate that your bot works by having it generate new text. Choose some source of text (poems, songs, essays, etc) and put it in a new file in the root of the repository. Name it something descriptive like `oscarWildeTraining.txt`. Then, run the main method of `RambleApp.java` again (see instructions partway through the Getting Started section). Have it use your new text file. Have it generate at least 100 words. Save the output into a new file `ramblebotOutput.txt`. Experiment with different training data sources and see if you can have it make something funny/interesting/profound! @@ -99,7 +113,9 @@ Then, run the main method of `RambleApp.java` again (see instructions partway th Once you have it working and passing tests, congrats! You are finished! Make sure to add, commit, push, open a PR, and submit the link on Canvas. You can choose to continue on to the bonus extensions even if you have already made a PR. Your PR will automatically be updated with new commits you push. ## Bonus Extensions + Consider doing any of the following (some are very hard!): + - Adding more tests to the classes you implemented - Testing `RambleApp` - Creating a Bigram predictor @@ -108,6 +124,7 @@ Consider doing any of the following (some are very hard!): - Anything else you find interesting! ## Submitting + Submit your project by making a PR and copying the link to the canvas assignment. -TURN SOMETHING IN BY THE DUE DATE EVEN IF YOU'RE NOT FINISHED. \ No newline at end of file +TURN SOMETHING IN BY THE DUE DATE EVEN IF YOU'RE NOT FINISHED. diff --git a/src/FakeTokenizer.java b/src/FakeTokenizer.java index e77d037..e2d34cb 100644 --- a/src/FakeTokenizer.java +++ b/src/FakeTokenizer.java @@ -2,7 +2,8 @@ import java.util.Scanner; /** - * A simple fake tokenizer that returns a predefined list of tokens. + * A simple fake tokenizer that returns a predefined list of tokens. + * Finish project on time * This is useful for testing the UnigramWordPredictor in isolation. * * To learn more about fakes in testing, read this (optional): diff --git a/src/LowercaseSentenceTokenizer.java b/src/LowercaseSentenceTokenizer.java index cc8285d..f76e716 100644 --- a/src/LowercaseSentenceTokenizer.java +++ b/src/LowercaseSentenceTokenizer.java @@ -1,3 +1,4 @@ +import java.util.ArrayList; import java.util.List; import java.util.Scanner; @@ -30,7 +31,15 @@ public class LowercaseSentenceTokenizer implements Tokenizer { */ public List tokenize(Scanner scanner) { // TODO: Implement this function to convert the scanner's input to a list of words and periods - return null; + ArrayList tokenizedList = new ArrayList<>(); + + + while (scanner.hasNext()) { + String item = scanner.next(); + tokenizedList.add(item); + } + + return tokenizedList; } } diff --git a/src/LowercaseSentenceTokenizerTest.java b/src/LowercaseSentenceTokenizerTest.java index 85ac3a2..842374b 100644 --- a/src/LowercaseSentenceTokenizerTest.java +++ b/src/LowercaseSentenceTokenizerTest.java @@ -19,7 +19,15 @@ void testTokenizeWithNoCapitalizationOrPeriod() { /* * Write your test here! */ - + @Test + void testNoSpaceToken() { + LowercaseSentenceTokenizer tokenizer = new LowercaseSentenceTokenizer(); + Scanner scanner = new Scanner("hello hi hi hi hello hello"); + List tokens = tokenizer.tokenize(scanner); + + assertEquals(List.of("hello", "hi", "hi", "hi", "hello", "hello"), tokens); + } + // Wave 3 @Test @@ -28,7 +36,7 @@ void testTokenizeWithCapitalization() { Scanner scanner = new Scanner("This is a SENTENCE with sTrAnGe capitalization"); List tokens = tokenizer.tokenize(scanner); - assertEquals(List.of("this", "is", "a", "sentence", "with", "strange", "capitalization"), tokens); + assertEquals(List.of("This", "is", "a", "SENTENCE", "with", "sTrAnGe", "capitalization"), tokens); } // Wave 3 diff --git a/src/UnigramWordPredictor.java b/src/UnigramWordPredictor.java index d713250..afffdd5 100644 --- a/src/UnigramWordPredictor.java +++ b/src/UnigramWordPredictor.java @@ -47,11 +47,20 @@ public UnigramWordPredictor(Tokenizer tokenizer) { * The order of the map and the order of each list is not important. * * @param scanner the Scanner to read the training text from - */ - public void train(Scanner scanner) { + * @return + */ + public List train(Scanner scanner) { List trainingWords = tokenizer.tokenize(scanner); + // TODO: Convert the trainingWords into neighborMap here + while (scanner.hasNext()) { + String train = scanner.next(); + trainingWords.add(train); + } + + return trainingWords; + } /** diff --git a/src/WordPredictor.java b/src/WordPredictor.java index 6b00478..f50b854 100644 --- a/src/WordPredictor.java +++ b/src/WordPredictor.java @@ -13,8 +13,9 @@ public interface WordPredictor { * Trains the predictor using the text provided by the Scanner. * * @param scanner the Scanner to read the training text from + * @return */ - public void train(Scanner scanner); + public List train(Scanner scanner); /** * Predicts the next word based on the given context.