diff --git a/acl_latex.tex b/acl_latex.tex deleted file mode 100644 index 2eba2f1..0000000 --- a/acl_latex.tex +++ /dev/null @@ -1,377 +0,0 @@ -\documentclass[11pt]{article} - -% Change "review" to "final" to generate the final (sometimes called camera-ready) version. -% Change to "preprint" to generate a non-anonymous version with page numbers. -\usepackage[review]{acl} - -% Standard package includes -\usepackage{times} -\usepackage{latexsym} - -% For proper rendering and hyphenation of words containing Latin characters (including in bib files) -\usepackage[T1]{fontenc} -% For Vietnamese characters -% \usepackage[T5]{fontenc} -% See https://www.latex-project.org/help/documentation/encguide.pdf for other character sets - -% This assumes your files are encoded as UTF8 -\usepackage[utf8]{inputenc} - -% This is not strictly necessary, and may be commented out, -% but it will improve the layout of the manuscript, -% and will typically save some space. -\usepackage{microtype} - -% This is also not strictly necessary, and may be commented out. -% However, it will improve the aesthetics of text in -% the typewriter font. -\usepackage{inconsolata} - -%Including images in your LaTeX document requires adding -%additional package(s) -\usepackage{graphicx} - -% If the title and author information does not fit in the area allocated, uncomment the following -% -%\setlength\titlebox{} -% -% and set to something 5cm or larger. - -\title{Instructions for *ACL Proceedings} - -% Author information can be set in various styles: -% For several authors from the same institution: -% \author{Author 1 \and ... \and Author n \\ -% Address line \\ ... \\ Address line} -% if the names do not fit well on one line use -% Author 1 \\ {\bf Author 2} \\ ... \\ {\bf Author n} \\ -% For authors from different institutions: -% \author{Author 1 \\ Address line \\ ... \\ Address line -% \And ... \And -% Author n \\ Address line \\ ... \\ Address line} -% To start a separate ``row'' of authors use \AND, as in -% \author{Author 1 \\ Address line \\ ... \\ Address line -% \AND -% Author 2 \\ Address line \\ ... \\ Address line \And -% Author 3 \\ Address line \\ ... \\ Address line} - -\author{First Author \\ - Affiliation / Address line 1 \\ - Affiliation / Address line 2 \\ - Affiliation / Address line 3 \\ - \texttt{email@domain} \\\And - Second Author \\ - Affiliation / Address line 1 \\ - Affiliation / Address line 2 \\ - Affiliation / Address line 3 \\ - \texttt{email@domain} \\} - -%\author{ -% \textbf{First Author\textsuperscript{1}}, -% \textbf{Second Author\textsuperscript{1,2}}, -% \textbf{Third T. Author\textsuperscript{1}}, -% \textbf{Fourth Author\textsuperscript{1}}, -%\\ -% \textbf{Fifth Author\textsuperscript{1,2}}, -% \textbf{Sixth Author\textsuperscript{1}}, -% \textbf{Seventh Author\textsuperscript{1}}, -% \textbf{Eighth Author \textsuperscript{1,2,3,4}}, -%\\ -% \textbf{Ninth Author\textsuperscript{1}}, -% \textbf{Tenth Author\textsuperscript{1}}, -% \textbf{Eleventh E. Author\textsuperscript{1,2,3,4,5}}, -% \textbf{Twelfth Author\textsuperscript{1}}, -%\\ -% \textbf{Thirteenth Author\textsuperscript{3}}, -% \textbf{Fourteenth F. Author\textsuperscript{2,4}}, -% \textbf{Fifteenth Author\textsuperscript{1}}, -% \textbf{Sixteenth Author\textsuperscript{1}}, -%\\ -% \textbf{Seventeenth S. Author\textsuperscript{4,5}}, -% \textbf{Eighteenth Author\textsuperscript{3,4}}, -% \textbf{Nineteenth N. Author\textsuperscript{2,5}}, -% \textbf{Twentieth Author\textsuperscript{1}} -%\\ -%\\ -% \textsuperscript{1}Affiliation 1, -% \textsuperscript{2}Affiliation 2, -% \textsuperscript{3}Affiliation 3, -% \textsuperscript{4}Affiliation 4, -% \textsuperscript{5}Affiliation 5 -%\\ -% \small{ -% \textbf{Correspondence:} \href{mailto:email@domain}{email@domain} -% } -%} - -\begin{document} -\maketitle -\begin{abstract} -This document is a supplement to the general instructions for *ACL authors. It contains instructions for using the \LaTeX{} style files for ACL conferences. -The document itself conforms to its own specifications, and is therefore an example of what your manuscript should look like. -These instructions should be used both for papers submitted for review and for final versions of accepted papers. -\end{abstract} - -\section{Introduction} - -These instructions are for authors submitting papers to *ACL conferences using \LaTeX. They are not self-contained. All authors must follow the general instructions for *ACL proceedings,\footnote{\url{http://acl-org.github.io/ACLPUB/formatting.html}} and this document contains additional instructions for the \LaTeX{} style files. - -The templates include the \LaTeX{} source of this document (\texttt{acl\_latex.tex}), -the \LaTeX{} style file used to format it (\texttt{acl.sty}), -an ACL bibliography style (\texttt{acl\_natbib.bst}), -an example bibliography (\texttt{custom.bib}), -and the bibliography for the ACL Anthology (\texttt{anthology.bib}). - -\section{Engines} - -To produce a PDF file, pdf\LaTeX{} is strongly recommended (over original \LaTeX{} plus dvips+ps2pdf or dvipdf). -The style file \texttt{acl.sty} can also be used with -lua\LaTeX{} and -Xe\LaTeX{}, which are especially suitable for text in non-Latin scripts. -The file \texttt{acl\_lualatex.tex} in this repository provides -an example of how to use \texttt{acl.sty} with either -lua\LaTeX{} or -Xe\LaTeX{}. - -\section{Preamble} - -The first line of the file must be -\begin{quote} -\begin{verbatim} -\documentclass[11pt]{article} -\end{verbatim} -\end{quote} - -To load the style file in the review version: -\begin{quote} -\begin{verbatim} -\usepackage[review]{acl} -\end{verbatim} -\end{quote} -For the final version, omit the \verb|review| option: -\begin{quote} -\begin{verbatim} -\usepackage{acl} -\end{verbatim} -\end{quote} - -To use Times Roman, put the following in the preamble: -\begin{quote} -\begin{verbatim} -\usepackage{times} -\end{verbatim} -\end{quote} -(Alternatives like txfonts or newtx are also acceptable.) - -Please see the \LaTeX{} source of this document for comments on other packages that may be useful. - -Set the title and author using \verb|\title| and \verb|\author|. Within the author list, format multiple authors using \verb|\and| and \verb|\And| and \verb|\AND|; please see the \LaTeX{} source for examples. - -By default, the box containing the title and author names is set to the minimum of 5 cm. If you need more space, include the following in the preamble: -\begin{quote} -\begin{verbatim} -\setlength\titlebox{} -\end{verbatim} -\end{quote} -where \verb|| is replaced with a length. Do not set this length smaller than 5 cm. - -\section{Document Body} - -\subsection{Footnotes} - -Footnotes are inserted with the \verb|\footnote| command.\footnote{This is a footnote.} - -\subsection{Tables and figures} - -See Table~\ref{tab:accents} for an example of a table and its caption. -\textbf{Do not override the default caption sizes.} - -\begin{table} - \centering - \begin{tabular}{lc} - \hline - \textbf{Command} & \textbf{Output} \\ - \hline - \verb|{\"a}| & {\"a} \\ - \verb|{\^e}| & {\^e} \\ - \verb|{\`i}| & {\`i} \\ - \verb|{\.I}| & {\.I} \\ - \verb|{\o}| & {\o} \\ - \verb|{\'u}| & {\'u} \\ - \verb|{\aa}| & {\aa} \\\hline - \end{tabular} - \begin{tabular}{lc} - \hline - \textbf{Command} & \textbf{Output} \\ - \hline - \verb|{\c c}| & {\c c} \\ - \verb|{\u g}| & {\u g} \\ - \verb|{\l}| & {\l} \\ - \verb|{\~n}| & {\~n} \\ - \verb|{\H o}| & {\H o} \\ - \verb|{\v r}| & {\v r} \\ - \verb|{\ss}| & {\ss} \\ - \hline - \end{tabular} - \caption{Example commands for accented characters, to be used in, \emph{e.g.}, Bib\TeX{} entries.} - \label{tab:accents} -\end{table} - -As much as possible, fonts in figures should conform -to the document fonts. See Figure~\ref{fig:experiments} for an example of a figure and its caption. - -Using the \verb|graphicx| package graphics files can be included within figure -environment at an appropriate point within the text. -The \verb|graphicx| package supports various optional arguments to control the -appearance of the figure. -You must include it explicitly in the \LaTeX{} preamble (after the -\verb|\documentclass| declaration and before \verb|\begin{document}|) using -\verb|\usepackage{graphicx}|. - -\begin{figure}[t] - \includegraphics[width=\columnwidth]{example-image-golden} - \caption{A figure with a caption that runs for more than one line. - Example image is usually available through the \texttt{mwe} package - without even mentioning it in the preamble.} - \label{fig:experiments} -\end{figure} - -\begin{figure*}[t] - \includegraphics[width=0.48\linewidth]{example-image-a} \hfill - \includegraphics[width=0.48\linewidth]{example-image-b} - \caption {A minimal working example to demonstrate how to place - two images side-by-side.} -\end{figure*} - -\subsection{Hyperlinks} - -Users of older versions of \LaTeX{} may encounter the following error during compilation: -\begin{quote} -\verb|\pdfendlink| ended up in different nesting level than \verb|\pdfstartlink|. -\end{quote} -This happens when pdf\LaTeX{} is used and a citation splits across a page boundary. The best way to fix this is to upgrade \LaTeX{} to 2018-12-01 or later. - -\subsection{Citations} - -\begin{table*} - \centering - \begin{tabular}{lll} - \hline - \textbf{Output} & \textbf{natbib command} & \textbf{ACL only command} \\ - \hline - \citep{Gusfield:97} & \verb|\citep| & \\ - \citealp{Gusfield:97} & \verb|\citealp| & \\ - \citet{Gusfield:97} & \verb|\citet| & \\ - \citeyearpar{Gusfield:97} & \verb|\citeyearpar| & \\ - \citeposs{Gusfield:97} & & \verb|\citeposs| \\ - \hline - \end{tabular} - \caption{\label{citation-guide} - Citation commands supported by the style file. - The style is based on the natbib package and supports all natbib citation commands. - It also supports commands defined in previous ACL style files for compatibility. - } -\end{table*} - -Table~\ref{citation-guide} shows the syntax supported by the style files. -We encourage you to use the natbib styles. -You can use the command \verb|\citet| (cite in text) to get ``author (year)'' citations, like this citation to a paper by \citet{Gusfield:97}. -You can use the command \verb|\citep| (cite in parentheses) to get ``(author, year)'' citations \citep{Gusfield:97}. -You can use the command \verb|\citealp| (alternative cite without parentheses) to get ``author, year'' citations, which is useful for using citations within parentheses (e.g. \citealp{Gusfield:97}). - -A possessive citation can be made with the command \verb|\citeposs|. -This is not a standard natbib command, so it is generally not compatible -with other style files. - -\subsection{References} - -\nocite{Ando2005,andrew2007scalable,rasooli-tetrault-2015} - -The \LaTeX{} and Bib\TeX{} style files provided roughly follow the American Psychological Association format. -If your own bib file is named \texttt{custom.bib}, then placing the following before any appendices in your \LaTeX{} file will generate the references section for you: -\begin{quote} -\begin{verbatim} -\bibliography{custom} -\end{verbatim} -\end{quote} - -You can obtain the complete ACL Anthology as a Bib\TeX{} file from \url{https://aclweb.org/anthology/anthology.bib.gz}. -To include both the Anthology and your own .bib file, use the following instead of the above. -\begin{quote} -\begin{verbatim} -\bibliography{anthology,custom} -\end{verbatim} -\end{quote} - -Please see Section~\ref{sec:bibtex} for information on preparing Bib\TeX{} files. - -\subsection{Equations} - -An example equation is shown below: -\begin{equation} - \label{eq:example} - A = \pi r^2 -\end{equation} - -Labels for equation numbers, sections, subsections, figures and tables -are all defined with the \verb|\label{label}| command and cross references -to them are made with the \verb|\ref{label}| command. - -This an example cross-reference to Equation~\ref{eq:example}. - -\subsection{Appendices} - -Use \verb|\appendix| before any appendix section to switch the section numbering over to letters. See Appendix~\ref{sec:appendix} for an example. - -\section{Bib\TeX{} Files} -\label{sec:bibtex} - -Unicode cannot be used in Bib\TeX{} entries, and some ways of typing special characters can disrupt Bib\TeX's alphabetization. The recommended way of typing special characters is shown in Table~\ref{tab:accents}. - -Please ensure that Bib\TeX{} records contain DOIs or URLs when possible, and for all the ACL materials that you reference. -Use the \verb|doi| field for DOIs and the \verb|url| field for URLs. -If a Bib\TeX{} entry has a URL or DOI field, the paper title in the references section will appear as a hyperlink to the paper, using the hyperref \LaTeX{} package. - -\section*{Limitations} - -This document does not cover the content requirements for ACL or any -other specific venue. Check the author instructions for -information on -maximum page lengths, the required ``Limitations'' section, -and so on. - -\section*{Acknowledgments} - -This document has been adapted -by Steven Bethard, Ryan Cotterell and Rui Yan -from the instructions for earlier ACL and NAACL proceedings, including those for -ACL 2019 by Douwe Kiela and Ivan Vuli\'{c}, -NAACL 2019 by Stephanie Lukin and Alla Roskovskaya, -ACL 2018 by Shay Cohen, Kevin Gimpel, and Wei Lu, -NAACL 2018 by Margaret Mitchell and Stephanie Lukin, -Bib\TeX{} suggestions for (NA)ACL 2017/2018 from Jason Eisner, -ACL 2017 by Dan Gildea and Min-Yen Kan, -NAACL 2017 by Margaret Mitchell, -ACL 2012 by Maggie Li and Michael White, -ACL 2010 by Jing-Shin Chang and Philipp Koehn, -ACL 2008 by Johanna D. Moore, Simone Teufel, James Allan, and Sadaoki Furui, -ACL 2005 by Hwee Tou Ng and Kemal Oflazer, -ACL 2002 by Eugene Charniak and Dekang Lin, -and earlier ACL and EACL formats written by several people, including -John Chen, Henry S. Thompson and Donald Walker. -Additional elements were taken from the formatting instructions of the \emph{International Joint Conference on Artificial Intelligence} and the \emph{Conference on Computer Vision and Pattern Recognition}. - -% Bibliography entries for the entire Anthology, followed by custom entries -%\bibliography{custom,anthology-overleaf-1,anthology-overleaf-2} - -% Custom bibliography entries only -\bibliography{custom} - -\appendix - -\section{Example Appendix} -\label{sec:appendix} - -This is an appendix. - -\end{document} diff --git a/alc_latex.tex b/alc_latex.tex new file mode 100644 index 0000000..0ef0030 --- /dev/null +++ b/alc_latex.tex @@ -0,0 +1,252 @@ + + +\begin{document} +\maketitle +\begin{abstract} +\begin{abstract} +This study investigates how attention mechanisms influence natural language inference (NLI) performance across three architectures: LSTM with Cross-Attention, BiLSTM with Self-Attention, and Transformer encoders. +Each model was trained under identical hyperparameter settings to ensure fair comparison and reproducibility. +Extensive experiments, including ablation studies on attention mechanisms, transformer depth, and embedding dimensions, reveal nuanced trade-offs between interpretability and accuracy. +While BiLSTM Self-Attention achieved the highest test accuracy among attention-enhanced models (71.64\%), the BiLSTM without attention performed slightly better (72.25\%), suggesting that attention improves interpretability rather than raw performance. +Transformer models demonstrated sensitivity to architecture depth and embedding size, reflecting overfitting under limited data. +Overall, the findings highlight that attention mechanisms enhance transparency in reasoning while requiring careful tuning to maintain generalisation. +\end{abstract} + + + +\section{Introduction} + +Natural Language Inference (NLI) is a foundational problem in Natural Language Processing (NLP) that requires models to determine whether a hypothesis logically follows from a given premise. Although recent advances in deep learning have improved NLI accuracy, challenges remain in explaining why a model arrives at a specific decision. This limitation has led to growing interest in attention-based mechanisms that provide both predictive performance and interpretability. + +This project examines the role of attention architectures in enhancing interpretability and reasoning in NLI models. Specifically, it compares three architectures—LSTM Cross-Attention, BiLSTM Self-Attention, and Transformer—to investigate how different attention mechanisms influence model behaviour, contextual understanding, and generalisation in low-data scientific settings. Each model was trained under identical hyperparameter configurations, including batch size, optimiser, learning rate, and dropout, ensuring that any observed performance variation stemmed solely from architectural differences. + +Through this comparison, the study explores how Cross-Attention captures surface-level token relationships, how Self-Attention introduces intra-sequence contextual learning, and how Transformer attention models global dependencies. Both quantitative metrics and qualitative visualisations (attention heatmaps) were employed to assess accuracy, attention intensity, and semantic focus across models. + +The findings demonstrate that while LSTM Cross-Attention achieved competitive accuracy (70.65%), its attention remained lexically driven and semantically shallow. Conversely, BiLSTM Self-Attention and Transformer models exhibited more meaningful attention distributions, highlighting a trade-off between performance and interpretability. This work emphasises that interpretability and accuracy are not always correlated and that understanding model focus patterns is essential for building transparent and trustworthy NLP systems. + +\section{Related Work} + +Attention mechanisms have become central to improving both performance and interpretability in natural language processing. Early approaches by \citet{bahdanau2014neural} introduced the concept of attention in neural machine translation, allowing models to selectively focus on relevant parts of input sequences. This innovation addressed the limitations of traditional recurrent networks that struggled with long-range dependencies. + +Subsequent research expanded the role of attention beyond translation. \citet{lin2017structured} proposed a Self-Attention mechanism within a BiLSTM architecture, enabling models to capture intra-sentence relationships and generate interpretable structured sentence representations. Self-Attention not only improved classification accuracy but also provided insights into which tokens contributed most to model predictions. + +The paradigm shifted significantly with the introduction of the Transformer architecture by \citet{vaswani2017attention}, which replaced recurrence entirely with multi-head Self-Attention layers. This design allowed global context modelling and parallel computation, leading to state-of-the-art performance across NLP tasks. + +Building on this progression, the present study compares Cross-Attention, Self-Attention, and Transformer architectures under controlled conditions to investigate how attention structure influences reasoning and interpretability in NLI. + + +\section{Methodology} + +This study evaluates three neural architectures—LSTM with Cross-Attention, BiLSTM with Self-Attention, and a Transformer encoder—on science-domain natural language inference (binary: \textsc{Entails} vs.\ \textsc{Neutral}). All models were trained under identical optimisation settings to ensure a fair comparison; only the architectural component under study was varied in ablations. + +\subsection{Dataset and Preprocessing} +The dataset comprises premise–hypothesis pairs labelled as \textsc{Entails} (1) or \textsc{Neutral} (0). We used the provided splits: \textbf{23{,}088} train (87\%), \textbf{1{,}304} validation (4.9\%), and \textbf{2{,}126} test (8.1\%). Text was tokenised via a lightweight regex tokenizer (\verb|re.findall(r'\b\w+\b', text)|), lowercased, truncated/padded to \textbf{128} tokens, and mapped to a vocabulary of \textbf{11{,}338} types. + +\subsection{Model Architectures} +\textbf{LSTM + Cross-Attention:} Two-layer bidirectional LSTM (\textbf{hidden}=256) encodes premise/hypothesis; a cross-attention module aligns hypothesis tokens to premise states to form an alignment-aware sentence representation. +\textbf{BiLSTM + Self-Attention:} Two-layer bidirectional LSTM (\textbf{hidden}=256) followed by multi-head self-attention (\textbf{4 heads}) to weight salient tokens before max/mean pooling. +\textbf{Transformer Encoder:} A lightweight encoder with \textbf{2 layers}, \textbf{8 heads} per layer, and position-wise feed-forward blocks; premise and hypothesis are encoded and combined for classification. + +\subsection{Training Configuration} +We used Adam (\(\beta_1{=}0.9,\ \beta_2{=}0.999\)), \textbf{learning rate} \(=1\times10^{-4}\), \textbf{dropout} \(=0.1\), small batch size (kept constant across models), and \textbf{early stopping} with \textbf{patience}=3 on validation loss for up to \textbf{8} epochs. To maintain numerical stability, we applied \textbf{gradient clipping} (max-norm \(=1.0\)) and skipped the rare batch whose loss evaluated to NaN. Training ran on an NVIDIA Tesla T4 GPU. + +\subsection{Evaluation Protocol} +We report test \textbf{accuracy} and perform controlled ablations on (i) attention usage, (ii) Transformer depth, and (iii) embedding dimension. We additionally provide \textbf{qualitative} analyses via attention heatmaps to examine where models focus when predicting. + + + +\section{Results} + +\subsection{Main Quantitative Results} +Table~\ref{tab:main} reports test accuracy for the three architectures under the shared training setup. The BiLSTM with Self-Attention attains the highest accuracy, closely followed by the LSTM with Cross-Attention; the Transformer trails slightly under identical hyperparameters. + +\begin{table}[h] +\centering +\begin{tabular}{l c} +\hline +\textbf{Model} & \textbf{Accuracy (\%)} \\ +\hline +LSTM + Cross-Attention & 70.65 \\ +BiLSTM + Self-Attention & \textbf{71.64} \\ +Transformer (2 layers, 8 heads) & 68.96 \\ +\hline +\end{tabular} +\caption{Overall test accuracy on the NLI task (binary: \textsc{Entails}/\textsc{Neutral}).} +\label{tab:main} +\end{table} + +\noindent +\textbf{Observation.} The small gap among models suggests that \emph{attention design} rather than model size alone drives differences in this low-data regime. Note that Transformers typically require tuned optimisation and larger corpora; in preliminary sensitivity checks we observed accuracy improved when \emph{increasing} learning rate and weight decay (not reported as a full sweep to preserve fairness). + +\subsection{Ablation Studies} + +\paragraph{A1: Does attention help?} +We toggled attention within the BiLSTM encoder to isolate its effect. + +\begin{table}[h] +\centering +\begin{tabular}{l c c} +\hline +\textbf{Variant} & \textbf{With Attn} & \textbf{No Attn} \\ +\hline +BiLSTM & 71.64 & \textbf{72.25} \\ +\hline +\end{tabular} +\caption{Ablation A1: Self-attention \emph{reduced} accuracy by 0.61 pp in this setting.} +\label{tab:ablation-attn} +\end{table} + +\noindent +\textbf{Insight.} Attention slightly \emph{hurt} accuracy (\(-0.61\) pp), despite improving \emph{interpretability}. This “attention paradox” indicates that the learned weighting can over-emphasise spurious cues in small, specialised corpora. + +\paragraph{A2: How deep should the Transformer be?} +We varied the encoder depth while holding all else fixed. + +\begin{table}[h] +\centering +\begin{tabular}{l c c c} +\hline +\textbf{Layers} & \textbf{2} & \textbf{4} & \textbf{6} \\ +\hline +Accuracy (\%) & 68.96 & \textbf{69.90} & 68.49 \\ +\hline +\end{tabular} +\caption{Ablation A2: 4 layers $>$ 2 layers $>$ 6 layers; deeper models overfit.} +\label{tab:ablation-depth} +\end{table} + +\noindent +\textbf{Insight.} Depth beyond 4 layers degraded performance, consistent with \emph{overfitting} rather than “more stable convergence.” + +\paragraph{A3: What is the right embedding size?} +We swept embedding dimensionality, keeping architecture and optimisation unchanged. + +\begin{table}[h] +\centering +\begin{tabular}{l c c c} +\hline +\textbf{Dim} & \textbf{64} & \textbf{128} & \textbf{256} \\ +\hline +Accuracy (\%) & \textbf{71.87} & 71.78 & 68.96 \\ +\hline +\end{tabular} +\caption{Ablation A3: Compact embeddings (64–128) outperform larger ones (256).} +\label{tab:ablation-embed} +\end{table} + +\noindent +\textbf{Insight.} Smaller embeddings generalise better; 256-dim adds capacity that the dataset cannot support. + +\subsection{Qualitative Analysis} +We inspected attention heatmaps on held-out examples to understand model focus patterns. + +\paragraph{Case: Sample 7 (error analysis).} +LSTM+Cross-Attention assigned near-maximal weight to \emph{``petals''}, a lexically salient but \emph{semantically uninformative} token for the entailment decision, and predicted \textsc{Neutral} (incorrect; gold \textsc{Entails}). This aligns with its narrower, lower-contrast heatmaps and indicates reliance on surface overlap. + +\paragraph{Case: Sample 70 (error analysis).} +The LSTM again mispredicted (pred.\ \textsc{Entails}) while highlighting tokens that did not support the hypothesis. In contrast, BiLSTM+Self-Attention distributed weight over relational and connective terms, and the Transformer spread attention more globally, reflecting better contextual integration (even when its final accuracy was lower overall). + +\paragraph{Takeaway.} +Across samples, BiLSTM+Self-Attention and Transformer produced \emph{more interpretable} focus—emphasising negations, causal markers, and key relational nouns—whereas LSTM+Cross-Attention frequently fixated on isolated content words. This explains why BiLSTM won overall, and why attention \emph{helped understanding} even when A1 showed a small accuracy drop. + + + +\section{Discussion} + +\section{Discussion} + +The experiments highlight a clear trade-off between quantitative accuracy and qualitative interpretability. +Among the baseline architectures, the BiLSTM with Self-Attention achieved the highest test accuracy among attention-enhanced models at 71.64\%, outperforming the LSTM with Cross-Attention (70.65\%) by 0.99~percentage points. +However, the BiLSTM without attention achieved an even higher score of 72.25\%, indicating that attention mechanisms slightly reduced numerical performance while improving interpretability. +This suggests that, in low-resource NLI tasks, attention may not always boost accuracy but can enhance transparency in model reasoning. + +Transformer layer depth showed non-monotonic effects: performance improved from two layers (68.96\%) to four layers (69.90\%), but degraded with six layers (68.49\%), reflecting overfitting under limited data. +Embedding-size ablations confirmed that compact representations (64--128 dimensions) generalised best, with 64-dimensional embeddings achieving the highest accuracy (71.87\%) overall. +These results suggest that smaller, regularised architectures are better suited to low-resource NLI settings, where large models tend to overfit. + +Qualitative analyses further revealed distinct reasoning behaviours across architectures. +LSTM Cross-Attention relied on shallow lexical overlap (e.g., attending to \textit{petals}) and often misclassified examples. +BiLSTM Self-Attention captured compositional sentence relations and made correct semantic predictions, while Transformer self-attention identified meaningful contextual cues, such as focusing on the token \textit{separate} (score = 0.975), correctly identifying relational structure. +Overall, attention mechanisms improved interpretability but did not consistently increase accuracy, suggesting that practitioners must balance transparency and performance based on task requirements. + + + +\section{Conclusion} + +This study compared three neural architectures—LSTM with Cross-Attention, BiLSTM with Self-Attention, and Transformer encoders—to evaluate how attention structures influence reasoning and performance in low-resource NLI tasks. +BiLSTM Self-Attention provided superior interpretability through compositional semantic understanding at competitive accuracy (71.64\%), though the BiLSTM without attention achieved marginally higher raw performance (72.25\%). +Ablation studies revealed that attention mechanisms can reduce accuracy while improving transparency, underscoring the importance of evaluating models beyond numerical metrics. +Transformer results further highlighted the need to carefully tune depth and embedding size to prevent overfitting. +Future work should investigate whether pre-trained embeddings or external knowledge bases can mitigate attention's overfitting tendency while preserving interpretability gains. + + +\section{Experimental Reflections} + +Several practical insights emerged during experimentation. +All models were trained with identical hyperparameters to ensure fair comparison, +though preliminary experiments indicated that Transformers could achieve higher accuracy +with architecture-specific tuning (higher learning rate and weight decay). +We prioritised controlled comparison over maximum performance to isolate architectural effects. +Early stopping consistently improved generalisation, while occasional NaN batches were skipped +to maintain numerical stability. The small batch size (16) enhanced regularisation despite introducing slight training variance. +Attention heatmaps revealed that LSTM Cross-Attention occasionally focused on lexically salient +but semantically irrelevant tokens, whereas BiLSTM Self-Attention exhibited more compositional reasoning. +These reflections highlight the importance of combining quantitative evaluation with qualitative inspection +to achieve both robustness and interpretability in neural NLI models. + + + + + +\bibliographystyle{acl_natbib} +\bibliography{references} + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +