[WIP] Fix the transformers' error and update the score_logra and score_TRAK#214
[WIP] Fix the transformers' error and update the score_logra and score_TRAK#214DanielNi868 wants to merge 31 commits intoTRAIS-Lab:mainfrom
Conversation
| --proj_dim 256 \ | ||
| --proj_max_batch_size 8 \ | ||
| --proj_type random_mask | ||
|
|
| --output_dir ../checkpoints \ | ||
| --block_size 512 \ | ||
| --seed ${SEED} | ||
|
|
| trust_remote_code=args.trust_remote_code, | ||
| attn_implementation="eager", # Use eager attention for better performance | ||
| ) | ||
| model = model.cuda() |
There was a problem hiding this comment.
No need to have information of this troubleshotting since it is no longer an issue.
There was a problem hiding this comment.
Just remove the troubleshotting message and no need to tell user how did the toolkit developer resolved the problem.
| config=config, | ||
| low_cpu_mem_usage=args.low_cpu_mem_usage, | ||
| trust_remote_code=args.trust_remote_code, | ||
| attn_implementation="eager", # Use eager attention for better performance |
There was a problem hiding this comment.
The only thing in this PR is to add this line in score_TRAK and score_logra.
There was a problem hiding this comment.
If there are something else you need to change in score_logra and score_TRAK, please comment why they are needed in order to fix the transformer error regarding the vmap. Otherwise, we may keep them unchanged.
|
I deleted those files and updated readme.md |
| trust_remote_code=args.trust_remote_code, | ||
| attn_implementation="eager", # Use eager attention for better performance | ||
| ) | ||
| model = model.cuda() |
There was a problem hiding this comment.
Just remove the troubleshotting message and no need to tell user how did the toolkit developer resolved the problem.
| config=config, | ||
| low_cpu_mem_usage=args.low_cpu_mem_usage, | ||
| trust_remote_code=args.trust_remote_code, | ||
| attn_implementation="eager", # Use eager attention for better performance |
There was a problem hiding this comment.
If there are something else you need to change in score_logra and score_TRAK, please comment why they are needed in order to fix the transformer error regarding the vmap. Otherwise, we may keep them unchanged.
|
I updated readme.md in TRAK, the function f in main should keep the unsqueeze(0) added to avoid the dimension mismatch error def f(params, batch): |
| try: | ||
| from transformers.utils import send_example_telemetry | ||
| except ImportError: | ||
| send_example_telemetry = None # Not available in newer transformers versions |
There was a problem hiding this comment.
I have ImportError: cannot import name 'send_example_telemetry' from 'transformers.utils' using original code
| #fix the import error in newer transformers versions | ||
| if send_example_telemetry is not None: | ||
| send_example_telemetry("run_clm_no_trainer", args) | ||
|
|
There was a problem hiding this comment.
I cannot import name 'send_example_telemetry' from 'transformers.utils'
| default="random_mask", | ||
| choices=["normal", "rademacher", "random_mask", "sjlt", "grass"], | ||
| help="Random projection type used for TRAK/TracIn (default: random_mask).", | ||
| ) |
There was a problem hiding this comment.
I have torch.OutOfMemoryError: CUDA out of memory. When I did not have these 3 parameters
| input_ids = input_ids.unsqueeze(0).cuda() | ||
| attention_mask = attention_mask.unsqueeze(0).cuda() | ||
| labels = labels.unsqueeze(0).cuda() | ||
|
|
There was a problem hiding this comment.
I have IndexError: too many indices for tensor of dimension 2
| # Re-add batch dimension removed by vmap | ||
| input_ids = input_ids.unsqueeze(0).cuda() | ||
| attention_mask = attention_mask.unsqueeze(0).cuda() | ||
| labels = labels.unsqueeze(0).cuda() |
There was a problem hiding this comment.
I have IndexError: too many indices for tensor of dimension 2
| input_ids = input_ids.unsqueeze(0).cuda() | ||
| attention_mask = attention_mask.unsqueeze(0).cuda() | ||
| labels = labels.unsqueeze(0).cuda() | ||
|
|
There was a problem hiding this comment.
I have IndexError: too many indices for tensor of dimension 2
| if len(parts) == 2 and parts[1].isdigit(): | ||
| num_checkpoints = int(parts[1]) | ||
| requested_checkpoints = int(parts[1]) | ||
| else: |
There was a problem hiding this comment.
I run this again and it can run this time, I think this modification can be deleted
|
|
||
| checkpoints = [str(p) for p in available_checkpoint_dirs[:requested_checkpoints]] | ||
|
|
||
| elif method in ["TracIn", "Grad-Dot", "Grad-Cos"]: |
There was a problem hiding this comment.
I run this again and it can run this time, I think this modification can be deleted
| method, | ||
| ) | ||
| checkpoints = [str(p) for p in available_checkpoint_dirs[:requested_checkpoints]] | ||
| else: |
There was a problem hiding this comment.
I run this again and it can run this time, I think this modification can be deleted
| "proj_dim": 2048, | ||
| "proj_dim": args.proj_dim, | ||
| "proj_max_batch_size": args.proj_max_batch_size, | ||
| "proj_type": args.proj_type, |
There was a problem hiding this comment.
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 GiB. GPU 0 has a total capacity of 44.35 GiB of which 41.56 GiB is free. Including non-PyTorch memory, this process has 2.79 GiB memory in use. Of the allocated memory 2.36 GiB is allocated by PyTorch, and 114.69 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting
| "proj_dim": 2048, | ||
| "proj_dim": args.proj_dim, | ||
| "proj_max_batch_size": args.proj_max_batch_size, | ||
| "proj_type": args.proj_type, |
There was a problem hiding this comment.
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 GiB. GPU 0 has a total capacity of 44.35 GiB of which 41.56 GiB is free. Including non-PyTorch memory, this process has 2.79 GiB memory in use. Of the allocated memory 2.36 GiB is allocated by PyTorch, and 114.69 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting
| else: | ||
| raise e | ||
|
|
||
| new_model.eval() |
There was a problem hiding this comment.
I think this modification can be deleted
| from transformers.utils import send_example_telemetry | ||
| except ImportError: | ||
| send_example_telemetry = None # Not available in newer transformers versions | ||
|
|
There was a problem hiding this comment.
same import error as in TRAK
There was a problem hiding this comment.
Just make the requirement.txt to have transformers==4.46.0
| #fix the import error in newer transformers versions | ||
| if send_example_telemetry is not None: | ||
| send_example_telemetry("run_clm_no_trainer", args) | ||
|
|
There was a problem hiding this comment.
same import error as in TRAK
|
|
||
| model_id = -1 | ||
| model_id = 0 # Use checkpoint 0 (final checkpoint) | ||
| checkpoint = f"{args.output_dir}/{model_id}" |
There was a problem hiding this comment.
FileNotFoundError: Checkpoint directory not found: /dattri/experiments/gpt2_wikitext/checkpoints/-1. Please ensure the checkpoint exists at this path.
| else: | ||
| raise e | ||
|
|
||
| model.eval() |
There was a problem hiding this comment.
huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '../checkpoints/-1'. Use repo_type argument if needed.
huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '../checkpoints/-1'. Use repo_type argument if needed.
|
@DanielNi868 please don't paste the error message to the PR. |
experiments/gpt2_wikitext/readme.md
Outdated
| ```bash | ||
| return AttentionMaskConverter._expand_mask(mask=mask, dtype=dtype, tgt_len=tgt_len) | ||
| ``` | ||
| the troubleshooting can be avoided by setting the attn_implementation paramater to 'eager' in from_pretrained function |
There was a problem hiding this comment.
Just delete the troubleshooting section.
| from transformers.pytorch_utils import Conv1D | ||
| from dattri.task import AttributionTask | ||
|
|
||
| model_id = -1 |
There was a problem hiding this comment.
here we need a fully trained model in id=-1
| checkpoint = f"{args.output_dir}/{model_id}" | ||
|
|
||
| def checkpoints_load_func(model, checkpoint): | ||
| model = AutoModelForCausalLM.from_pretrained(checkpoint).cuda() |
There was a problem hiding this comment.
what error message did you meet for this line 596-597 function.
…nd unused ssh_config_template.txt
032c7ab to
aa51bd0
Compare
Fix the transformers' error ( setting the attn_implementation = 'eager' )
Modify checkpoints_load_func for score_logra.py and score_TRAK.py to fix hugging face error
Update readme and comment put the vamp error for transformers