danulr05 commited on
Commit
e0f9bee
·
verified ·
1 Parent(s): 01a8222

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +2126 -37
app.py CHANGED
@@ -33,6 +33,1924 @@ import re
33
  import requests
34
  import json
35
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
  app = Flask(__name__)
37
  CORS(app)
38
 
@@ -68,12 +1986,150 @@ llm = ChatGoogleGenerativeAI(
68
  # Initialize translator
69
  translator = Translator()
70
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
  def detect_sinhala_content(text: str) -> bool:
72
  """Detect if text contains Sinhala characters"""
73
  # Sinhala Unicode range: U+0D80 to U+0DFF
74
  sinhala_pattern = re.compile(r'[\u0D80-\u0DFF]')
75
  return bool(sinhala_pattern.search(text))
76
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77
  def detect_singlish(text: str) -> bool:
78
  """Detect common Singlish patterns and words"""
79
  singlish_words = [
@@ -92,17 +2148,28 @@ def detect_singlish(text: str) -> bool:
92
  # Consider it Singlish if it has 2 or more Singlish words
93
  return singlish_word_count >= 2
94
 
95
- def transliterate_singlish_to_sinhala(text: str) -> str:
96
- """Convert Romanized Sinhala (Singlish) to Sinhala script using Swabhasha API"""
97
  try:
98
- # Swabhasha transliterator API endpoint
99
- # Note: You may need to replace this with the actual Swabhasha API endpoint
100
- # For now, using a mock implementation that could be replaced with actual service
 
101
 
102
- # Alternative: Use a local transliteration library or service
103
- # For demo purposes, we'll use a simplified mapping approach
104
- # In production, integrate with actual Swabhasha transliterator
105
 
 
 
 
 
 
 
 
 
 
 
106
  # Common Singlish to Sinhala mappings (simplified)
107
  singlish_to_sinhala_map = {
108
  'mokadda': 'මොකද්ද',
@@ -151,10 +2218,11 @@ def transliterate_singlish_to_sinhala(text: str) -> str:
151
  else:
152
  transliterated_words.append(word) # Keep original if no mapping
153
 
 
154
  return ' '.join(transliterated_words)
155
 
156
  except Exception as e:
157
- logger.error(f"Transliteration error: {e}")
158
  return text # Return original text if transliteration fails
159
 
160
  def translate_text(text: str, target_language: str = 'en') -> str:
@@ -168,51 +2236,67 @@ def translate_text(text: str, target_language: str = 'en') -> str:
168
 
169
  def process_multilingual_input(user_message: str) -> tuple:
170
  """
171
- Process multilingual input using the improved pipeline:
172
- Romanized Sinhala -> Swabhasha Transliterator -> Sinhala -> Google Translate -> English
173
  """
174
- original_language = 'en'
175
- needs_translation = False
176
  processed_message = user_message
177
  transliteration_used = False
 
 
 
 
 
 
 
 
 
178
 
179
- # Check if input contains Sinhala characters
180
- if detect_sinhala_content(user_message):
181
- logger.info("Detected Sinhala script input")
 
182
  original_language = 'si'
183
  needs_translation = True
184
  processed_message = translate_text(user_message, 'en')
185
  logger.info(f"Translated from Sinhala: '{user_message}' -> '{processed_message}'")
186
-
187
- # Check if input is likely Singlish (Romanized Sinhala)
188
- elif detect_singlish(user_message):
189
- logger.info("Detected Romanized Sinhala (Singlish) input")
190
  original_language = 'singlish'
191
  needs_translation = True
192
  transliteration_used = True
 
193
 
194
  try:
195
- # Step 1: Transliterate Romanized Sinhala to Sinhala script
196
- sinhala_text = transliterate_singlish_to_sinhala(user_message)
197
- logger.info(f"Transliterated Singlish to Sinhala: '{user_message}' -> '{sinhala_text}'")
198
 
199
  # Step 2: Translate Sinhala to English for search
200
  processed_message = translate_text(sinhala_text, 'en')
201
- logger.info(f"Translated Sinhala to English: '{sinhala_text}' -> '{processed_message}'")
202
 
203
  except Exception as e:
204
- logger.error(f"Error in Singlish processing pipeline: {e}")
205
  # Fallback: try direct translation or keep original
206
  try:
207
  processed_message = translate_text(user_message, 'en')
208
- logger.info(f"Fallback translation from Singlish: '{user_message}' -> '{processed_message}'")
209
  except:
210
  processed_message = user_message
211
  needs_translation = False
212
  transliteration_used = False
213
- logger.info("Using original Singlish text for search")
214
 
215
- return processed_message, original_language, needs_translation, transliteration_used
 
 
 
 
 
 
 
216
 
217
  def translate_response_if_needed(response: str, original_language: str) -> str:
218
  """Translate response back to original language if needed"""
@@ -389,8 +2473,8 @@ def generate_response_with_rag(user_message: str, session_id: str) -> Dict[str,
389
  """Generate response using RAG with memory and multilingual support"""
390
  try:
391
  # Process multilingual input
392
- processed_message, original_language, needs_translation, transliteration_used = process_multilingual_input(user_message)
393
- logger.info(f"Input processing: original='{user_message}', processed='{processed_message}', lang='{original_language}', transliteration='{transliteration_used}'")
394
 
395
  # Get or create memory for this session
396
  memory = get_or_create_memory(session_id)
@@ -444,9 +2528,6 @@ Guidelines:
444
  - Maintain conversation continuity
445
  - Be culturally sensitive when discussing Sri Lankan policies
446
  - When responding in Sinhala, use appropriate formal language for policy discussions
447
- - DO NOT use asterisks (*) for formatting or emphasis
448
- - DO NOT use markdown formatting like **bold** or *italic*
449
- - Use plain text without any special formatting characters
450
 
451
  Please provide a helpful response:"""
452
 
@@ -478,7 +2559,9 @@ Please provide a helpful response:"""
478
  "sources": sources,
479
  "language_detected": original_language,
480
  "translation_used": needs_translation,
481
- "transliteration_used": transliteration_used
 
 
482
  }
483
 
484
  except Exception as e:
@@ -500,7 +2583,9 @@ Please provide a helpful response:"""
500
  "sources": [],
501
  "language_detected": original_language if 'original_language' in locals() else 'en',
502
  "translation_used": False,
503
- "transliteration_used": False
 
 
504
  }
505
 
506
  def clear_session_memory(session_id: str) -> bool:
@@ -542,7 +2627,9 @@ def chat():
542
  "user_message": user_message,
543
  "language_detected": result.get("language_detected", "en"),
544
  "translation_used": result.get("translation_used", False),
545
- "transliteration_used": result.get("transliteration_used", False)
 
 
546
  })
547
 
548
  except Exception as e:
@@ -740,7 +2827,7 @@ def detect_language():
740
  "error": "Text is required"
741
  }), 400
742
 
743
- processed_message, original_language, needs_translation, transliteration_used = process_multilingual_input(text)
744
 
745
  return jsonify({
746
  "original_text": text,
@@ -748,6 +2835,8 @@ def detect_language():
748
  "language_detected": original_language,
749
  "translation_needed": needs_translation,
750
  "transliteration_used": transliteration_used,
 
 
751
  "contains_sinhala": detect_sinhala_content(text),
752
  "is_singlish": detect_singlish(text)
753
  })
 
33
  import requests
34
  import json
35
 
36
+ # AI-based language processing imports
37
+ from transformers import pipeline
38
+ import torch
39
+
40
+ app = Flask(__name__)
41
+ CORS(app)
42
+
43
+ # Configure logging
44
+ logging.basicConfig(level=logging.INFO)
45
+ logger = logging.getLogger(__name__)
46
+
47
+ # Configure Gemini
48
+ GEMINI_API_KEY = os.getenv('GEMINI_API_KEY')
49
+ if not GEMINI_API_KEY:
50
+ logger.error("GEMINI_API_KEY not found in environment variables")
51
+ raise ValueError("Please set GEMINI_API_KEY in your .env file")
52
+
53
+ # Configure Pinecone
54
+ PINECONE_API_KEY = os.getenv('PINECONE_API_KEY')
55
+ if not PINECONE_API_KEY:
56
+ logger.error("PINECONE_API_KEY not found in environment variables")
57
+ raise ValueError("Please set PINECONE_API_KEY in your .env file")
58
+
59
+ # Initialize Pinecone and embedding model
60
+ pc = Pinecone(api_key=PINECONE_API_KEY)
61
+ BUDGET_INDEX_NAME = "budget-proposals-index"
62
+ embed_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
63
+
64
+ # Initialize LangChain components
65
+ llm = ChatGoogleGenerativeAI(
66
+ model="gemini-2.5-flash",
67
+ google_api_key=GEMINI_API_KEY,
68
+ temperature=0.7,
69
+ max_tokens=1000
70
+ )
71
+
72
+ # Initialize translator
73
+ translator = Translator()
74
+
75
+ # Initialize AI-based language detection and transliteration models
76
+ logger.info("Loading AI models...")
77
+ try:
78
+ # Use Google Translate's language detection which supports Sinhala
79
+ # This is more reliable for Sinhala than the HF model
80
+ language_detector = "google_translate" # Use Google Translate for detection
81
+ logger.info("Using Google Translate for language detection (supports Sinhala)")
82
+ except Exception as e:
83
+ logger.error(f"Failed to initialize language detection: {e}")
84
+ language_detector = None
85
+
86
+ try:
87
+ # Sinhala transliteration model
88
+ sinhala_transliterator = pipeline(
89
+ "text2text-generation",
90
+ model="deshanksuman/swabhashambart50SinhalaTransliteration"
91
+ )
92
+ logger.info("Sinhala transliteration model loaded successfully")
93
+ except Exception as e:
94
+ logger.error(f"Failed to load transliteration model: {e}")
95
+ sinhala_transliterator = None
96
+
97
+ def detect_sinhala_content(text: str) -> bool:
98
+ """Detect if text contains Sinhala characters"""
99
+ # Sinhala Unicode range: U+0D80 to U+0DFF
100
+ sinhala_pattern = re.compile(r'[\u0D80-\u0DFF]')
101
+ return bool(sinhala_pattern.search(text))
102
+
103
+ def ai_detect_language(text: str) -> Dict[str, Any]:
104
+ """Enhanced language detection using Google Translate (supports Sinhala)"""
105
+ try:
106
+ if language_detector is None:
107
+ # Fallback to rule-based detection
108
+ return rule_based_language_detection(text)
109
+
110
+ # Check for Sinhala Unicode first (most reliable)
111
+ has_sinhala_unicode = detect_sinhala_content(text)
112
+ if has_sinhala_unicode:
113
+ return {
114
+ 'language': 'si',
115
+ 'confidence': 0.95,
116
+ 'is_sinhala_unicode': True,
117
+ 'is_romanized_sinhala': False,
118
+ 'is_english': False,
119
+ 'detection_method': 'unicode_detection'
120
+ }
121
+
122
+ # Use Google Translate for language detection
123
+ try:
124
+ detection_result = translator.detect(text)
125
+ detected_lang = detection_result.lang
126
+ confidence = detection_result.confidence
127
+
128
+ # Check if it's romanized Sinhala based on content analysis
129
+ is_romanized_sinhala = (
130
+ detected_lang in ['en', 'unknown'] and
131
+ detect_singlish(text)
132
+ )
133
+
134
+ # Override detection if Singlish patterns are strong
135
+ if is_romanized_sinhala:
136
+ detected_lang = 'singlish'
137
+ confidence = max(0.7, confidence) # Boost confidence for Singlish
138
+
139
+ return {
140
+ 'language': detected_lang,
141
+ 'confidence': confidence,
142
+ 'is_sinhala_unicode': False,
143
+ 'is_romanized_sinhala': is_romanized_sinhala,
144
+ 'is_english': detected_lang == 'en' and not is_romanized_sinhala,
145
+ 'detection_method': 'google_translate'
146
+ }
147
+
148
+ except Exception as e:
149
+ logger.error(f"Google Translate detection failed: {e}")
150
+ # Fallback to rule-based with Singlish detection
151
+ return enhanced_rule_based_detection(text)
152
+
153
+ except Exception as e:
154
+ logger.error(f"Language detection failed: {e}")
155
+ return rule_based_language_detection(text)
156
+
157
+ def enhanced_rule_based_detection(text: str) -> Dict[str, Any]:
158
+ """Enhanced rule-based detection with better Singlish recognition"""
159
+ has_sinhala_unicode = detect_sinhala_content(text)
160
+ is_romanized_sinhala = detect_singlish(text) and not has_sinhala_unicode
161
+
162
+ # More sophisticated Singlish detection
163
+ if not has_sinhala_unicode and not is_romanized_sinhala:
164
+ # Check for common Sinhala sentence patterns in English letters
165
+ sinhala_patterns = [
166
+ r'\b(mokadda|kohomada|api|oya|mama)\b',
167
+ r'\b(eka|meka|thiyenne|kiyala)\b',
168
+ r'\b(gana|genna|danna|karanna)\b',
169
+ r'\b(budget|proposal).*\b(gana|eka)\b'
170
+ ]
171
+
172
+ text_lower = text.lower()
173
+ pattern_matches = sum(1 for pattern in sinhala_patterns if re.search(pattern, text_lower))
174
+
175
+ if pattern_matches >= 1: # Lower threshold for better detection
176
+ is_romanized_sinhala = True
177
+
178
+ if has_sinhala_unicode:
179
+ language_code = 'si'
180
+ confidence = 0.9
181
+ elif is_romanized_sinhala:
182
+ language_code = 'singlish'
183
+ confidence = 0.8
184
+ else:
185
+ language_code = 'en'
186
+ confidence = 0.7
187
+
188
+ return {
189
+ 'language': language_code,
190
+ 'confidence': confidence,
191
+ 'is_sinhala_unicode': has_sinhala_unicode,
192
+ 'is_romanized_sinhala': is_romanized_sinhala,
193
+ 'is_english': language_code == 'en',
194
+ 'detection_method': 'enhanced_rule_based'
195
+ }
196
+
197
+ def rule_based_language_detection(text: str) -> Dict[str, Any]:
198
+ """Fallback rule-based language detection"""
199
+ has_sinhala_unicode = detect_sinhala_content(text)
200
+ is_romanized_sinhala = detect_singlish(text) and not has_sinhala_unicode
201
+ is_english = not has_sinhala_unicode and not is_romanized_sinhala
202
+
203
+ if has_sinhala_unicode:
204
+ language_code = 'si'
205
+ elif is_romanized_sinhala:
206
+ language_code = 'singlish'
207
+ else:
208
+ language_code = 'en'
209
+
210
+ return {
211
+ 'language': language_code,
212
+ 'confidence': 0.8, # Default confidence for rule-based
213
+ 'is_sinhala_unicode': has_sinhala_unicode,
214
+ 'is_romanized_sinhala': is_romanized_sinhala,
215
+ 'is_english': is_english,
216
+ 'detection_method': 'rule_based'
217
+ }
218
+
219
+ def detect_singlish(text: str) -> bool:
220
+ """Detect common Singlish patterns and words"""
221
+ singlish_words = [
222
+ 'mokadda', 'kohomada', 'api', 'oya', 'mama', 'eka', 'meka', 'oya', 'dan', 'kiyala',
223
+ 'budget', 'proposal', 'karan', 'karanna', 'gana', 'genna', 'danna', 'ahala', 'denna',
224
+ 'mata', 'ape', 'wage', 'wenas', 'thiyenne', 'kiyanawa', 'balanawa', 'pennanna',
225
+ 'sampura', 'mudal', 'pasal', 'vyaparayak', 'rajaye', 'arthikaya', 'sammandala',
226
+ 'kara', 'karanna', 'giya', 'yanawa', 'enawa', 'gihin', 'awe', 'nane', 'inne',
227
+ 'danna', 'kiyanna', 'balanna', 'ganna', 'denna', 'yanna', 'enna'
228
+ ]
229
+
230
+ # Convert to lowercase and check for common Singlish words
231
+ text_lower = text.lower()
232
+ singlish_word_count = sum(1 for word in singlish_words if word in text_lower)
233
+
234
+ # Consider it Singlish if it has 2 or more Singlish words
235
+ return singlish_word_count >= 2
236
+
237
+ def ai_transliterate_singlish_to_sinhala(text: str) -> str:
238
+ """AI-based transliteration from Romanized Sinhala to Sinhala script"""
239
+ try:
240
+ if sinhala_transliterator is None:
241
+ # Fallback to rule-based transliteration
242
+ logger.info("AI transliterator not available, using rule-based fallback")
243
+ return rule_based_transliterate_singlish_to_sinhala(text)
244
+
245
+ # Use AI model for transliteration
246
+ result = sinhala_transliterator(text, max_length=256, num_return_sequences=1)
247
+ transliterated_text = result[0]['generated_text']
248
+
249
+ logger.info(f"AI transliteration: '{text}' -> '{transliterated_text}'")
250
+ return transliterated_text
251
+
252
+ except Exception as e:
253
+ logger.error(f"AI transliteration failed: {e}")
254
+ return rule_based_transliterate_singlish_to_sinhala(text)
255
+
256
+ def rule_based_transliterate_singlish_to_sinhala(text: str) -> str:
257
+ """Fallback rule-based transliteration for Romanized Sinhala"""
258
+ try:
259
+ # Common Singlish to Sinhala mappings (simplified)
260
+ singlish_to_sinhala_map = {
261
+ 'mokadda': 'මොකද්ද',
262
+ 'kohomada': 'කොහොමද',
263
+ 'api': 'අපි',
264
+ 'oya': 'ඔයා',
265
+ 'mama': 'මම',
266
+ 'eka': 'එක',
267
+ 'meka': 'මේක',
268
+ 'dan': 'දැන්',
269
+ 'kiyala': 'කියලා',
270
+ 'gana': 'ගැන',
271
+ 'genna': 'ගන්න',
272
+ 'danna': 'දන්න',
273
+ 'dennna': 'දෙන්න',
274
+ 'mata': 'මට',
275
+ 'ape': 'අපේ',
276
+ 'thiyenne': 'තියෙන්නේ',
277
+ 'kiyanawa': 'කියනවා',
278
+ 'balanawa': 'බලනවා',
279
+ 'pennanna': 'පෙන්නන්න',
280
+ 'sampura': 'සම්පූර්ණ',
281
+ 'mudal': 'මුදල්',
282
+ 'pasal': 'පාසල්',
283
+ 'rajaye': 'රජයේ',
284
+ 'arthikaya': 'ආර්ථිකය',
285
+ 'kara': 'කර',
286
+ 'karanna': 'කරන්න',
287
+ 'giya': 'ගිය',
288
+ 'yanawa': 'යනවා',
289
+ 'enawa': 'එනවා',
290
+ 'inne': 'ඉන්නේ',
291
+ 'yanna': 'යන්න',
292
+ 'enna': 'එන්න'
293
+ }
294
+
295
+ # Simple word-by-word replacement
296
+ words = text.lower().split()
297
+ transliterated_words = []
298
+
299
+ for word in words:
300
+ # Remove punctuation for mapping
301
+ clean_word = re.sub(r'[^\w]', '', word)
302
+ if clean_word in singlish_to_sinhala_map:
303
+ transliterated_words.append(singlish_to_sinhala_map[clean_word])
304
+ else:
305
+ transliterated_words.append(word) # Keep original if no mapping
306
+
307
+ logger.info(f"Rule-based transliteration: '{text}' -> '{' '.join(transliterated_words)}'")
308
+ return ' '.join(transliterated_words)
309
+
310
+ except Exception as e:
311
+ logger.error(f"Rule-based transliteration error: {e}")
312
+ return text # Return original text if transliteration fails
313
+
314
+ def translate_text(text: str, target_language: str = 'en') -> str:
315
+ """Translate text using Google Translate"""
316
+ try:
317
+ result = translator.translate(text, dest=target_language)
318
+ return result.text
319
+ except Exception as e:
320
+ logger.error(f"Translation error: {e}")
321
+ return text # Return original text if translation fails
322
+
323
+ def process_multilingual_input(user_message: str) -> tuple:
324
+ """
325
+ AI-enhanced multilingual input processing:
326
+ AI Language Detection -> AI Transliteration -> Google Translate -> English
327
+ """
328
+ processed_message = user_message
329
+ transliteration_used = False
330
+ ai_detection_used = False
331
+
332
+ # Step 1: AI-based language detection
333
+ language_info = ai_detect_language(user_message)
334
+ original_language = language_info['language']
335
+ confidence = language_info['confidence']
336
+ detection_method = language_info['detection_method']
337
+
338
+ logger.info(f"Language detection: {original_language} (confidence: {confidence:.2f}, method: {detection_method})")
339
+
340
+ # Determine processing based on detected language
341
+ if language_info['is_sinhala_unicode']:
342
+ # Direct Sinhala Unicode -> English translation
343
+ logger.info("Processing Sinhala Unicode input")
344
+ original_language = 'si'
345
+ needs_translation = True
346
+ processed_message = translate_text(user_message, 'en')
347
+ logger.info(f"Translated from Sinhala: '{user_message}' -> '{processed_message}'")
348
+
349
+ elif language_info['is_romanized_sinhala']:
350
+ # Romanized Sinhala -> AI Transliteration -> Translation
351
+ logger.info("Processing Romanized Sinhala (Singlish) input")
352
+ original_language = 'singlish'
353
+ needs_translation = True
354
+ transliteration_used = True
355
+ ai_detection_used = detection_method == 'ai'
356
+
357
+ try:
358
+ # Step 1: AI-based transliteration
359
+ sinhala_text = ai_transliterate_singlish_to_sinhala(user_message)
360
+ logger.info(f"AI transliterated: '{user_message}' -> '{sinhala_text}'")
361
+
362
+ # Step 2: Translate Sinhala to English for search
363
+ processed_message = translate_text(sinhala_text, 'en')
364
+ logger.info(f"Translated to English: '{sinhala_text}' -> '{processed_message}'")
365
+
366
+ except Exception as e:
367
+ logger.error(f"Error in AI processing pipeline: {e}")
368
+ # Fallback: try direct translation or keep original
369
+ try:
370
+ processed_message = translate_text(user_message, 'en')
371
+ logger.info(f"Fallback translation: '{user_message}' -> '{processed_message}'")
372
+ except:
373
+ processed_message = user_message
374
+ needs_translation = False
375
+ transliteration_used = False
376
+ logger.info("Using original text for search")
377
+
378
+ else:
379
+ # English or other languages
380
+ logger.info("Processing as English input")
381
+ original_language = 'en'
382
+ needs_translation = False
383
+ processed_message = user_message
384
+
385
+ return processed_message, original_language, needs_translation, transliteration_used, ai_detection_used, confidence
386
+
387
+ def translate_response_if_needed(response: str, original_language: str) -> str:
388
+ """Translate response back to original language if needed"""
389
+ if original_language == 'si':
390
+ # Translate back to Sinhala
391
+ try:
392
+ translated_response = translate_text(response, 'si')
393
+ logger.info(f"Translated response to Sinhala: '{response[:100]}...' -> '{translated_response[:100]}...'")
394
+ return translated_response
395
+ except Exception as e:
396
+ logger.error(f"Error translating response to Sinhala: {e}")
397
+ return response
398
+ elif original_language == 'singlish':
399
+ # For Singlish, we can optionally provide a mixed response
400
+ # For now, keep English response but could enhance later
401
+ return response
402
+
403
+ return response
404
+
405
+ def get_pinecone_index():
406
+ """Get the budget proposals Pinecone index"""
407
+ try:
408
+ return pc.Index(BUDGET_INDEX_NAME)
409
+ except Exception as e:
410
+ logger.error(f"Error accessing Pinecone index: {e}")
411
+ return None
412
+
413
+ def search_budget_proposals(query: str) -> str:
414
+ """Search budget proposals using the semantic search API"""
415
+ try:
416
+ import requests
417
+
418
+ # Use the deployed semantic search API
419
+ response = requests.post(
420
+ f"https://danulr05-budget-proposals-search-api.hf.space/api/search",
421
+ json={"query": query, "top_k": 5},
422
+ timeout=10
423
+ )
424
+
425
+ if response.status_code == 200:
426
+ data = response.json()
427
+ results = data.get("results", [])
428
+
429
+ if not results:
430
+ return "No relevant budget proposals found in the database."
431
+
432
+ # Build context from search results
433
+ context_parts = []
434
+ for result in results[:3]: # Limit to top 3 results
435
+ file_path = result.get("file_path", "")
436
+ category = result.get("category", "")
437
+ summary = result.get("summary", "")
438
+ cost = result.get("costLKR", "")
439
+ title = result.get("title", "")
440
+ content = result.get("content", "") # Get the actual content
441
+
442
+ context_parts.append(f"From {file_path} ({category}): {title}")
443
+ if content:
444
+ context_parts.append(f"Content: {content}")
445
+ elif summary:
446
+ context_parts.append(f"Summary: {summary}")
447
+ if cost and cost != "No Costing Available":
448
+ context_parts.append(f"Cost: {cost}")
449
+
450
+ return "\n\n".join(context_parts)
451
+ else:
452
+ return f"Error accessing semantic search API: {response.status_code}"
453
+
454
+ except Exception as e:
455
+ logger.error(f"Error searching budget proposals: {e}")
456
+ return f"Error searching database: {str(e)}"
457
+
458
+ # Create the RAG tool
459
+ search_tool = Tool(
460
+ name="search_budget_proposals",
461
+ description="Search for relevant budget proposals in the vector database. Use this when you need specific information about budget proposals, costs, policies, or implementation details.",
462
+ func=search_budget_proposals
463
+ )
464
+
465
+ # Create the prompt template for the agent
466
+ agent_prompt = ChatPromptTemplate.from_messages([
467
+ ("system", """You are a helpful assistant for budget proposals in Sri Lanka. You have access to a vector database containing detailed information about various budget proposals. You can communicate in English, Sinhala, and understand Singlish (Sinhala written in English letters).
468
+
469
+ When a user asks about budget proposals, you should:
470
+ 1. Use the search_budget_proposals tool to find relevant information
471
+ 2. Provide accurate, detailed responses based on the retrieved information
472
+ 3. Always cite the source documents when mentioning specific proposals
473
+ 4. Be professional but approachable in any language
474
+ 5. If the search doesn't return relevant results, acknowledge this and provide general guidance
475
+ 6. Respond in the same language or style as the user's question when possible
476
+
477
+ Guidelines:
478
+ - Always use the search tool for specific questions about budget proposals
479
+ - Include source citations for any mention of proposals, costs, policies, revenue, or implementation
480
+ - Keep responses clear and informative in any language
481
+ - Use a balanced tone - helpful but not overly casual
482
+ - If asked about topics not covered, redirect to relevant topics professionally
483
+ - Be culturally sensitive when discussing Sri Lankan policies and economic matters
484
+ - When responding in Sinhala, use appropriate formal language for policy discussions"""),
485
+ MessagesPlaceholder(variable_name="chat_history"),
486
+ ("human", "{input}"),
487
+ MessagesPlaceholder(variable_name="agent_scratchpad")
488
+ ])
489
+
490
+ # Store conversation memories for different sessions
491
+ conversation_memories: Dict[str, ConversationBufferWindowMemory] = {}
492
+
493
+ def get_or_create_memory(session_id: str) -> ConversationBufferWindowMemory:
494
+ """Get or create a memory instance for a session"""
495
+ if session_id not in conversation_memories:
496
+ # Create new memory with window of 10 messages (5 exchanges)
497
+ conversation_memories[session_id] = ConversationBufferWindowMemory(
498
+ k=10, # Remember last 10 messages
499
+ return_messages=True,
500
+ memory_key="chat_history"
501
+ )
502
+ logger.info(f"Created new memory for session: {session_id}")
503
+
504
+ return conversation_memories[session_id]
505
+
506
+ def create_agent(session_id: str) -> AgentExecutor:
507
+ """Create a LangChain agent with memory and RAG capabilities"""
508
+ memory = get_or_create_memory(session_id)
509
+
510
+ # Create the agent
511
+ agent = create_openai_functions_agent(
512
+ llm=llm,
513
+ tools=[search_tool],
514
+ prompt=agent_prompt
515
+ )
516
+
517
+ # Create agent executor with memory
518
+ agent_executor = AgentExecutor(
519
+ agent=agent,
520
+ tools=[search_tool],
521
+ memory=memory,
522
+ verbose=False,
523
+ handle_parsing_errors=True
524
+ )
525
+
526
+ return agent_executor
527
+
528
+ def get_available_pdfs() -> List[str]:
529
+ """Dynamically get list of available PDF files from assets directory"""
530
+ try:
531
+ import os
532
+ pdf_dir = "assets/pdfs"
533
+ if os.path.exists(pdf_dir):
534
+ pdf_files = [f for f in os.listdir(pdf_dir) if f.lower().endswith('.pdf')]
535
+ return pdf_files
536
+ else:
537
+ # Fallback to known PDFs if directory doesn't exist
538
+ return ['MLB.pdf', 'Cigs.pdf', 'Elec.pdf', 'Audit_EPF.pdf', 'EPF.pdf', 'Discretion.pdf', '1750164001872.pdf']
539
+ except Exception as e:
540
+ logger.error(f"Error getting available PDFs: {e}")
541
+ # Fallback to known PDFs
542
+ return ['MLB.pdf', 'Cigs.pdf', 'Elec.pdf', 'Audit_EPF.pdf', 'EPF.pdf', 'Discretion.pdf', '1750164001872.pdf']
543
+
544
+ def extract_sources_from_response(response: str) -> List[str]:
545
+ """Extract source documents mentioned in the response"""
546
+ sources = []
547
+
548
+ # Get dynamically available PDF files
549
+ available_pdfs = get_available_pdfs()
550
+
551
+ # Look for source patterns like "(Source: MLB.pdf)" or "(Sources: MLB.pdf, EPF.pdf)"
552
+ for pdf in available_pdfs:
553
+ if pdf in response:
554
+ sources.append(pdf)
555
+
556
+ return list(set(sources)) # Remove duplicates
557
+
558
+ def generate_response_with_rag(user_message: str, session_id: str) -> Dict[str, Any]:
559
+ """Generate response using RAG with memory and multilingual support"""
560
+ try:
561
+ # Process multilingual input
562
+ processed_message, original_language, needs_translation, transliteration_used, ai_detection_used, confidence = process_multilingual_input(user_message)
563
+ logger.info(f"Input processing: original='{user_message}', processed='{processed_message}', lang='{original_language}', transliteration='{transliteration_used}', ai_detection='{ai_detection_used}', confidence='{confidence:.2f}'")
564
+
565
+ # Get or create memory for this session
566
+ memory = get_or_create_memory(session_id)
567
+
568
+ # Search for relevant context using processed (English) message
569
+ search_context = search_budget_proposals(processed_message)
570
+
571
+ # Get conversation history for context
572
+ chat_history = memory.chat_memory.messages
573
+ conversation_context = ""
574
+ if chat_history:
575
+ # Get last few messages for context
576
+ recent_messages = chat_history[-6:] # Last 3 exchanges
577
+ conversation_parts = []
578
+ for msg in recent_messages:
579
+ if isinstance(msg, HumanMessage):
580
+ conversation_parts.append(f"User: {msg.content}")
581
+ elif isinstance(msg, AIMessage):
582
+ conversation_parts.append(f"Assistant: {msg.content}")
583
+ conversation_context = "\n".join(conversation_parts)
584
+
585
+ # Create a prompt with conversation history and retrieved context
586
+ language_instruction = ""
587
+ if original_language == 'si':
588
+ language_instruction = "\n\nIMPORTANT: The user asked in Sinhala. Please respond in Sinhala using proper Sinhala script and formal language appropriate for policy discussions."
589
+ elif original_language == 'singlish':
590
+ if transliteration_used:
591
+ language_instruction = "\n\nNote: The user used Romanized Sinhala (transliterated via Swabhasha). Please respond in Sinhala using proper Sinhala script and formal language appropriate for policy discussions."
592
+ else:
593
+ language_instruction = "\n\nNote: The user used Singlish (Sinhala words in English letters). You may respond in English but consider using some familiar Sri Lankan terminology where appropriate."
594
+
595
+ prompt = f"""You are a helpful assistant for budget proposals in Sri Lanka. You can communicate in English, Sinhala, and understand Singlish.
596
+
597
+ Based on the following information from the budget proposals database:
598
+
599
+ {search_context}
600
+
601
+ {conversation_context}
602
+
603
+ Current user question: {processed_message}
604
+ Original user input: {user_message}
605
+ {language_instruction}
606
+
607
+ Guidelines:
608
+ - Be professional but approachable in any language
609
+ - Include specific details from the retrieved information
610
+ - Cite the source documents when mentioning specific proposals
611
+ - If the search doesn't return relevant results, acknowledge this and provide general guidance
612
+ - Keep responses clear and informative
613
+ - Reference previous conversation context when relevant
614
+ - Maintain conversation continuity
615
+ - Be culturally sensitive when discussing Sri Lankan policies
616
+ - When responding in Sinhala, use appropriate formal language for policy discussions
617
+
618
+ Please provide a helpful response:"""
619
+
620
+ # Generate response using the LLM directly
621
+ response = llm.invoke(prompt)
622
+ response_text = response.content.strip()
623
+
624
+ # Translate response back if needed
625
+ if needs_translation and (original_language == 'si' or (original_language == 'singlish' and transliteration_used)):
626
+ response_text = translate_response_if_needed(response_text, original_language)
627
+
628
+ # Extract sources from response
629
+ sources = extract_sources_from_response(response_text)
630
+
631
+ # Add messages to memory (store original user message for context)
632
+ memory.chat_memory.add_user_message(user_message)
633
+ memory.chat_memory.add_ai_message(response_text)
634
+
635
+ # Get updated conversation history for context
636
+ chat_history = memory.chat_memory.messages
637
+
638
+ return {
639
+ "response": response_text,
640
+ "confidence": "high",
641
+ "session_id": session_id,
642
+ "conversation_length": len(chat_history),
643
+ "memory_used": True,
644
+ "rag_used": True,
645
+ "sources": sources,
646
+ "language_detected": original_language,
647
+ "translation_used": needs_translation,
648
+ "transliteration_used": transliteration_used,
649
+ "ai_detection_used": ai_detection_used,
650
+ "detection_confidence": confidence
651
+ }
652
+
653
+ except Exception as e:
654
+ logger.error(f"Error generating response with RAG: {e}")
655
+ # Provide error message in appropriate language
656
+ error_message = "I'm sorry, I'm having trouble processing your request right now. Please try again later."
657
+ if original_language == 'si':
658
+ try:
659
+ error_message = translate_text(error_message, 'si')
660
+ except:
661
+ pass # Keep English if translation fails
662
+
663
+ return {
664
+ "response": error_message,
665
+ "confidence": "error",
666
+ "session_id": session_id,
667
+ "memory_used": False,
668
+ "rag_used": False,
669
+ "sources": [],
670
+ "language_detected": original_language if 'original_language' in locals() else 'en',
671
+ "translation_used": False,
672
+ "transliteration_used": False,
673
+ "ai_detection_used": False,
674
+ "detection_confidence": 0.0
675
+ }
676
+
677
+ def clear_session_memory(session_id: str) -> bool:
678
+ """Clear memory for a specific session"""
679
+ try:
680
+ if session_id in conversation_memories:
681
+ del conversation_memories[session_id]
682
+ logger.info(f"Cleared memory for session: {session_id}")
683
+ return True
684
+ return False
685
+ except Exception as e:
686
+ logger.error(f"Error clearing memory: {e}")
687
+ return False
688
+
689
+ @app.route('/api/chat', methods=['POST'])
690
+ def chat():
691
+ """Enhanced chat endpoint with memory"""
692
+ try:
693
+ data = request.get_json()
694
+ user_message = data.get('message', '').strip()
695
+ session_id = data.get('session_id', 'default')
696
+
697
+ if not user_message:
698
+ return jsonify({
699
+ "error": "Message is required"
700
+ }), 400
701
+
702
+ # Generate response with memory
703
+ result = generate_response_with_rag(user_message, session_id)
704
+
705
+ return jsonify({
706
+ "response": result["response"],
707
+ "confidence": result["confidence"],
708
+ "session_id": session_id,
709
+ "conversation_length": result.get("conversation_length", 0),
710
+ "memory_used": result.get("memory_used", False),
711
+ "rag_used": result.get("rag_used", False),
712
+ "sources": result.get("sources", []),
713
+ "user_message": user_message,
714
+ "language_detected": result.get("language_detected", "en"),
715
+ "translation_used": result.get("translation_used", False),
716
+ "transliteration_used": result.get("transliteration_used", False),
717
+ "ai_detection_used": result.get("ai_detection_used", False),
718
+ "detection_confidence": result.get("detection_confidence", 0.0)
719
+ })
720
+
721
+ except Exception as e:
722
+ logger.error(f"Chat API error: {e}")
723
+ return jsonify({"error": str(e)}), 500
724
+
725
+ @app.route('/api/chat/clear', methods=['POST'])
726
+ def clear_chat():
727
+ """Clear chat memory for a session"""
728
+ try:
729
+ data = request.get_json()
730
+ session_id = data.get('session_id', 'default')
731
+
732
+ success = clear_session_memory(session_id)
733
+
734
+ return jsonify({
735
+ "success": success,
736
+ "session_id": session_id,
737
+ "message": "Chat memory cleared successfully" if success else "Session not found"
738
+ })
739
+
740
+ except Exception as e:
741
+ logger.error(f"Clear chat error: {e}")
742
+ return jsonify({"error": str(e)}), 500
743
+
744
+ @app.route('/api/chat/sessions', methods=['GET'])
745
+ def list_sessions():
746
+ """List all active chat sessions"""
747
+ try:
748
+ sessions = []
749
+ for session_id, memory in conversation_memories.items():
750
+ messages = memory.chat_memory.messages
751
+ sessions.append({
752
+ "session_id": session_id,
753
+ "message_count": len(messages),
754
+ "last_activity": datetime.now().isoformat() # Simplified for now
755
+ })
756
+
757
+ return jsonify({
758
+ "sessions": sessions,
759
+ "total_sessions": len(sessions)
760
+ })
761
+
762
+ except Exception as e:
763
+ logger.error(f"List sessions error: {e}")
764
+ return jsonify({"error": str(e)}), 500
765
+
766
+ @app.route('/api/chat/history/<session_id>', methods=['GET'])
767
+ def get_chat_history(session_id: str):
768
+ """Get chat history for a specific session"""
769
+ try:
770
+ if session_id not in conversation_memories:
771
+ return jsonify({
772
+ "session_id": session_id,
773
+ "history": [],
774
+ "message_count": 0
775
+ })
776
+
777
+ memory = conversation_memories[session_id]
778
+ messages = memory.chat_memory.messages
779
+
780
+ history = []
781
+ for msg in messages:
782
+ if isinstance(msg, HumanMessage):
783
+ history.append({
784
+ "type": "human",
785
+ "content": msg.content,
786
+ "timestamp": datetime.now().isoformat()
787
+ })
788
+ elif isinstance(msg, AIMessage):
789
+ history.append({
790
+ "type": "ai",
791
+ "content": msg.content,
792
+ "timestamp": datetime.now().isoformat()
793
+ })
794
+
795
+ return jsonify({
796
+ "session_id": session_id,
797
+ "history": history,
798
+ "message_count": len(history)
799
+ })
800
+
801
+ except Exception as e:
802
+ logger.error(f"Get chat history error: {e}")
803
+ return jsonify({"error": str(e)}), 500
804
+
805
+ @app.route('/api/chat/health', methods=['GET'])
806
+ def chat_health():
807
+ """Health check for the enhanced chatbot"""
808
+ try:
809
+ # Test LangChain connection and vector database
810
+ test_agent = create_agent("health_check")
811
+ test_response = test_agent.invoke({"input": "Hello"})
812
+
813
+ # Test vector database connection
814
+ pc_index = get_pinecone_index()
815
+ vector_db_status = "connected" if pc_index else "disconnected"
816
+
817
+ return jsonify({
818
+ "status": "healthy",
819
+ "message": "Enhanced budget proposals chatbot with RAG is running",
820
+ "langchain_status": "connected" if test_response else "disconnected",
821
+ "vector_db_status": vector_db_status,
822
+ "rag_enabled": True,
823
+ "active_sessions": len(conversation_memories),
824
+ "memory_enabled": True
825
+ })
826
+ except Exception as e:
827
+ return jsonify({
828
+ "status": "unhealthy",
829
+ "message": f"Error: {str(e)}"
830
+ }), 500
831
+
832
+ @app.route('/api/chat/debug/<session_id>', methods=['GET'])
833
+ def debug_session(session_id: str):
834
+ """Debug endpoint to check session memory"""
835
+ try:
836
+ memory_exists = session_id in conversation_memories
837
+ memory_info = {
838
+ "session_id": session_id,
839
+ "memory_exists": memory_exists,
840
+ "total_sessions": len(conversation_memories),
841
+ "session_keys": list(conversation_memories.keys())
842
+ }
843
+
844
+ if memory_exists:
845
+ memory = conversation_memories[session_id]
846
+ messages = memory.chat_memory.messages
847
+ memory_info.update({
848
+ "message_count": len(messages),
849
+ "messages": [
850
+ {
851
+ "type": getattr(msg, 'type', 'unknown'),
852
+ "content": getattr(msg, 'content', '')[:100] + "..." if len(getattr(msg, 'content', '')) > 100 else getattr(msg, 'content', '')
853
+ }
854
+ for msg in messages
855
+ ]
856
+ })
857
+
858
+ return jsonify(memory_info)
859
+
860
+ except Exception as e:
861
+ logger.error(f"Debug session error: {e}")
862
+ return jsonify({"error": str(e)}), 500
863
+
864
+ @app.route('/api/chat/suggestions', methods=['GET'])
865
+ def get_chat_suggestions():
866
+ """Get suggested questions for the chatbot with multilingual support"""
867
+ suggestions = [
868
+ "What are the maternity leave benefits proposed? 🤱",
869
+ "How do the cigarette tax proposals work? 💰",
870
+ "What changes are proposed for electricity tariffs? ⚡",
871
+ "Tell me about the EPF audit proposals 📊",
872
+ "What tax reforms are being suggested? 🏛️",
873
+ "How will these proposals affect the economy? 📈",
874
+ "What is the cost of implementing these proposals? 💵",
875
+ "Can you compare the costs of different proposals? ⚖️",
876
+ "What are the main benefits of these proposals? ✨",
877
+ "Budget proposals gana kiyanna 📋",
878
+ "EPF eka gana mokadda thiyenne? 💰",
879
+ "Electricity bill eka wenas wenawada? ⚡",
880
+ "Maternity leave benefits kiyannako 🤱",
881
+ "මේ budget proposals වල cost එක කීයද? 💵",
882
+ "රජයේ ආර්థික ප්‍රතිපත්ති ගැන කියන්න 🏛️"
883
+ ]
884
+
885
+ return jsonify({
886
+ "suggestions": suggestions,
887
+ "supported_languages": ["English", "Sinhala", "Singlish"]
888
+ })
889
+
890
+ @app.route('/api/chat/available-pdfs', methods=['GET'])
891
+ def get_available_pdfs_endpoint():
892
+ """Get list of available PDF files for debugging"""
893
+ try:
894
+ available_pdfs = get_available_pdfs()
895
+ return jsonify({
896
+ "available_pdfs": available_pdfs,
897
+ "count": len(available_pdfs),
898
+ "pdf_directory": "assets/pdfs"
899
+ })
900
+ except Exception as e:
901
+ logger.error(f"Error getting available PDFs: {e}")
902
+ return jsonify({"error": str(e)}), 500
903
+
904
+ @app.route('/api/chat/detect-language', methods=['POST'])
905
+ def detect_language():
906
+ """Test language detection functionality"""
907
+ try:
908
+ data = request.get_json()
909
+ text = data.get('text', '').strip()
910
+
911
+ if not text:
912
+ return jsonify({
913
+ "error": "Text is required"
914
+ }), 400
915
+
916
+ processed_message, original_language, needs_translation, transliteration_used, ai_detection_used, confidence = process_multilingual_input(text)
917
+
918
+ return jsonify({
919
+ "original_text": text,
920
+ "processed_text": processed_message,
921
+ "language_detected": original_language,
922
+ "translation_needed": needs_translation,
923
+ "transliteration_used": transliteration_used,
924
+ "ai_detection_used": ai_detection_used,
925
+ "detection_confidence": confidence,
926
+ "contains_sinhala": detect_sinhala_content(text),
927
+ "is_singlish": detect_singlish(text)
928
+ })
929
+
930
+ except Exception as e:
931
+ logger.error(f"Language detection error: {e}")
932
+ return jsonify({"error": str(e)}), 500
933
+
934
+ @app.route('/', methods=['GET'])
935
+ def home():
936
+ """Home endpoint with API documentation"""
937
+ return jsonify({
938
+ "message": "Multilingual Budget Proposals Chatbot API with Swabhasha Pipeline",
939
+ "version": "2.1.0",
940
+ "supported_languages": ["English", "Sinhala", "Romanized Sinhala (Singlish)"],
941
+ "features": ["RAG", "Memory", "Swabhasha Transliteration", "Google Translation", "FAISS Vector Store"],
942
+ "pipeline": "Romanized Sinhala → Swabhasha → Sinhala Script → Google Translate → English → LLM → Response",
943
+ "endpoints": {
944
+ "POST /api/chat": "Chat with memory, RAG, and multilingual support",
945
+ "POST /api/chat/clear": "Clear chat memory",
946
+ "GET /api/chat/sessions": "List active sessions",
947
+ "GET /api/chat/history/<session_id>": "Get chat history",
948
+ "GET /api/chat/health": "Health check",
949
+ "GET /api/chat/suggestions": "Get suggested questions (multilingual)",
950
+ "GET /api/chat/available-pdfs": "Get available PDF files",
951
+ "POST /api/chat/detect-language": "Test language detection"
952
+ },
953
+ "status": "running"
954
+ })
955
+
956
+ if __name__ == '__main__':
957
+ app.run(debug=False, host='0.0.0.0', port=7860)
958
+ #!/usr/bin/env python3
959
+ """
960
+ Enhanced Budget Proposals Chatbot API using LangChain with Memory and Agentic RAG
961
+ """
962
+
963
+ from flask import Flask, request, jsonify
964
+ from flask_cors import CORS
965
+ import os
966
+ import logging
967
+ import json
968
+ from datetime import datetime
969
+ from typing import Dict, List, Any
970
+
971
+ # LangChain imports
972
+ from langchain_google_genai import ChatGoogleGenerativeAI
973
+ from langchain.memory import ConversationBufferWindowMemory
974
+ from langchain.schema import HumanMessage, AIMessage
975
+ from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder
976
+ from langchain.chains import LLMChain
977
+ from langchain_community.chat_message_histories import RedisChatMessageHistory
978
+ from langchain.tools import Tool
979
+ from langchain.agents import AgentExecutor, create_openai_functions_agent
980
+ from langchain.agents.openai_functions_agent.base import OpenAIFunctionsAgent
981
+ from langchain.schema import BaseMessage
982
+
983
+ # Vector database imports
984
+ from pinecone import Pinecone
985
+ from sentence_transformers import SentenceTransformer
986
+
987
+ # Language detection and translation imports
988
+ from googletrans import Translator
989
+ import re
990
+ import requests
991
+ import json
992
+
993
+ # AI-based language processing imports
994
+ from transformers import pipeline
995
+ import torch
996
+
997
+ app = Flask(__name__)
998
+ CORS(app)
999
+
1000
+ # Configure logging
1001
+ logging.basicConfig(level=logging.INFO)
1002
+ logger = logging.getLogger(__name__)
1003
+
1004
+ # Configure Gemini
1005
+ GEMINI_API_KEY = os.getenv('GEMINI_API_KEY')
1006
+ if not GEMINI_API_KEY:
1007
+ logger.error("GEMINI_API_KEY not found in environment variables")
1008
+ raise ValueError("Please set GEMINI_API_KEY in your .env file")
1009
+
1010
+ # Configure Pinecone
1011
+ PINECONE_API_KEY = os.getenv('PINECONE_API_KEY')
1012
+ if not PINECONE_API_KEY:
1013
+ logger.error("PINECONE_API_KEY not found in environment variables")
1014
+ raise ValueError("Please set PINECONE_API_KEY in your .env file")
1015
+
1016
+ # Initialize Pinecone and embedding model
1017
+ pc = Pinecone(api_key=PINECONE_API_KEY)
1018
+ BUDGET_INDEX_NAME = "budget-proposals-index"
1019
+ embed_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
1020
+
1021
+ # Initialize LangChain components
1022
+ llm = ChatGoogleGenerativeAI(
1023
+ model="gemini-2.5-flash",
1024
+ google_api_key=GEMINI_API_KEY,
1025
+ temperature=0.7,
1026
+ max_tokens=1000
1027
+ )
1028
+
1029
+ # Initialize translator
1030
+ translator = Translator()
1031
+
1032
+ # Initialize AI-based language detection and transliteration models
1033
+ logger.info("Loading AI models...")
1034
+ try:
1035
+ # Use Google Translate's language detection which supports Sinhala
1036
+ # This is more reliable for Sinhala than the HF model
1037
+ language_detector = "google_translate" # Use Google Translate for detection
1038
+ logger.info("Using Google Translate for language detection (supports Sinhala)")
1039
+ except Exception as e:
1040
+ logger.error(f"Failed to initialize language detection: {e}")
1041
+ language_detector = None
1042
+
1043
+ try:
1044
+ # Sinhala transliteration model
1045
+ sinhala_transliterator = pipeline(
1046
+ "text2text-generation",
1047
+ model="deshanksuman/swabhashambart50SinhalaTransliteration"
1048
+ )
1049
+ logger.info("Sinhala transliteration model loaded successfully")
1050
+ except Exception as e:
1051
+ logger.error(f"Failed to load transliteration model: {e}")
1052
+ sinhala_transliterator = None
1053
+
1054
+ def detect_sinhala_content(text: str) -> bool:
1055
+ """Detect if text contains Sinhala characters"""
1056
+ # Sinhala Unicode range: U+0D80 to U+0DFF
1057
+ sinhala_pattern = re.compile(r'[\u0D80-\u0DFF]')
1058
+ return bool(sinhala_pattern.search(text))
1059
+
1060
+ def ai_detect_language(text: str) -> Dict[str, Any]:
1061
+ """Enhanced language detection using Google Translate (supports Sinhala)"""
1062
+ try:
1063
+ if language_detector is None:
1064
+ # Fallback to rule-based detection
1065
+ return rule_based_language_detection(text)
1066
+
1067
+ # Check for Sinhala Unicode first (most reliable)
1068
+ has_sinhala_unicode = detect_sinhala_content(text)
1069
+ if has_sinhala_unicode:
1070
+ return {
1071
+ 'language': 'si',
1072
+ 'confidence': 0.95,
1073
+ 'is_sinhala_unicode': True,
1074
+ 'is_romanized_sinhala': False,
1075
+ 'is_english': False,
1076
+ 'detection_method': 'unicode_detection'
1077
+ }
1078
+
1079
+ # Use Google Translate for language detection
1080
+ try:
1081
+ detection_result = translator.detect(text)
1082
+ detected_lang = detection_result.lang
1083
+ confidence = detection_result.confidence
1084
+
1085
+ # Check if it's romanized Sinhala based on content analysis
1086
+ is_romanized_sinhala = (
1087
+ detected_lang in ['en', 'unknown'] and
1088
+ detect_singlish(text)
1089
+ )
1090
+
1091
+ # Override detection if Singlish patterns are strong
1092
+ if is_romanized_sinhala:
1093
+ detected_lang = 'singlish'
1094
+ confidence = max(0.7, confidence) # Boost confidence for Singlish
1095
+
1096
+ return {
1097
+ 'language': detected_lang,
1098
+ 'confidence': confidence,
1099
+ 'is_sinhala_unicode': False,
1100
+ 'is_romanized_sinhala': is_romanized_sinhala,
1101
+ 'is_english': detected_lang == 'en' and not is_romanized_sinhala,
1102
+ 'detection_method': 'google_translate'
1103
+ }
1104
+
1105
+ except Exception as e:
1106
+ logger.error(f"Google Translate detection failed: {e}")
1107
+ # Fallback to rule-based with Singlish detection
1108
+ return enhanced_rule_based_detection(text)
1109
+
1110
+ except Exception as e:
1111
+ logger.error(f"Language detection failed: {e}")
1112
+ return rule_based_language_detection(text)
1113
+
1114
+ def enhanced_rule_based_detection(text: str) -> Dict[str, Any]:
1115
+ """Enhanced rule-based detection with better Singlish recognition"""
1116
+ has_sinhala_unicode = detect_sinhala_content(text)
1117
+ is_romanized_sinhala = detect_singlish(text) and not has_sinhala_unicode
1118
+
1119
+ # More sophisticated Singlish detection
1120
+ if not has_sinhala_unicode and not is_romanized_sinhala:
1121
+ # Check for common Sinhala sentence patterns in English letters
1122
+ sinhala_patterns = [
1123
+ r'\b(mokadda|kohomada|api|oya|mama)\b',
1124
+ r'\b(eka|meka|thiyenne|kiyala)\b',
1125
+ r'\b(gana|genna|danna|karanna)\b',
1126
+ r'\b(budget|proposal).*\b(gana|eka)\b'
1127
+ ]
1128
+
1129
+ text_lower = text.lower()
1130
+ pattern_matches = sum(1 for pattern in sinhala_patterns if re.search(pattern, text_lower))
1131
+
1132
+ if pattern_matches >= 1: # Lower threshold for better detection
1133
+ is_romanized_sinhala = True
1134
+
1135
+ if has_sinhala_unicode:
1136
+ language_code = 'si'
1137
+ confidence = 0.9
1138
+ elif is_romanized_sinhala:
1139
+ language_code = 'singlish'
1140
+ confidence = 0.8
1141
+ else:
1142
+ language_code = 'en'
1143
+ confidence = 0.7
1144
+
1145
+ return {
1146
+ 'language': language_code,
1147
+ 'confidence': confidence,
1148
+ 'is_sinhala_unicode': has_sinhala_unicode,
1149
+ 'is_romanized_sinhala': is_romanized_sinhala,
1150
+ 'is_english': language_code == 'en',
1151
+ 'detection_method': 'enhanced_rule_based'
1152
+ }
1153
+
1154
+ def rule_based_language_detection(text: str) -> Dict[str, Any]:
1155
+ """Fallback rule-based language detection"""
1156
+ has_sinhala_unicode = detect_sinhala_content(text)
1157
+ is_romanized_sinhala = detect_singlish(text) and not has_sinhala_unicode
1158
+ is_english = not has_sinhala_unicode and not is_romanized_sinhala
1159
+
1160
+ if has_sinhala_unicode:
1161
+ language_code = 'si'
1162
+ elif is_romanized_sinhala:
1163
+ language_code = 'singlish'
1164
+ else:
1165
+ language_code = 'en'
1166
+
1167
+ return {
1168
+ 'language': language_code,
1169
+ 'confidence': 0.8, # Default confidence for rule-based
1170
+ 'is_sinhala_unicode': has_sinhala_unicode,
1171
+ 'is_romanized_sinhala': is_romanized_sinhala,
1172
+ 'is_english': is_english,
1173
+ 'detection_method': 'rule_based'
1174
+ }
1175
+
1176
+ def detect_singlish(text: str) -> bool:
1177
+ """Detect common Singlish patterns and words"""
1178
+ singlish_words = [
1179
+ 'mokadda', 'kohomada', 'api', 'oya', 'mama', 'eka', 'meka', 'oya', 'dan', 'kiyala',
1180
+ 'budget', 'proposal', 'karan', 'karanna', 'gana', 'genna', 'danna', 'ahala', 'denna',
1181
+ 'mata', 'ape', 'wage', 'wenas', 'thiyenne', 'kiyanawa', 'balanawa', 'pennanna',
1182
+ 'sampura', 'mudal', 'pasal', 'vyaparayak', 'rajaye', 'arthikaya', 'sammandala',
1183
+ 'kara', 'karanna', 'giya', 'yanawa', 'enawa', 'gihin', 'awe', 'nane', 'inne',
1184
+ 'danna', 'kiyanna', 'balanna', 'ganna', 'denna', 'yanna', 'enna'
1185
+ ]
1186
+
1187
+ # Convert to lowercase and check for common Singlish words
1188
+ text_lower = text.lower()
1189
+ singlish_word_count = sum(1 for word in singlish_words if word in text_lower)
1190
+
1191
+ # Consider it Singlish if it has 2 or more Singlish words
1192
+ return singlish_word_count >= 2
1193
+
1194
+ def ai_transliterate_singlish_to_sinhala(text: str) -> str:
1195
+ """AI-based transliteration from Romanized Sinhala to Sinhala script"""
1196
+ try:
1197
+ if sinhala_transliterator is None:
1198
+ # Fallback to rule-based transliteration
1199
+ logger.info("AI transliterator not available, using rule-based fallback")
1200
+ return rule_based_transliterate_singlish_to_sinhala(text)
1201
+
1202
+ # Use AI model for transliteration
1203
+ result = sinhala_transliterator(text, max_length=256, num_return_sequences=1)
1204
+ transliterated_text = result[0]['generated_text']
1205
+
1206
+ logger.info(f"AI transliteration: '{text}' -> '{transliterated_text}'")
1207
+ return transliterated_text
1208
+
1209
+ except Exception as e:
1210
+ logger.error(f"AI transliteration failed: {e}")
1211
+ return rule_based_transliterate_singlish_to_sinhala(text)
1212
+
1213
+ def rule_based_transliterate_singlish_to_sinhala(text: str) -> str:
1214
+ """Fallback rule-based transliteration for Romanized Sinhala"""
1215
+ try:
1216
+ # Common Singlish to Sinhala mappings (simplified)
1217
+ singlish_to_sinhala_map = {
1218
+ 'mokadda': 'මොකද්ද',
1219
+ 'kohomada': 'කොහොමද',
1220
+ 'api': 'අපි',
1221
+ 'oya': 'ඔයා',
1222
+ 'mama': 'මම',
1223
+ 'eka': 'එක',
1224
+ 'meka': 'මේක',
1225
+ 'dan': 'දැන්',
1226
+ 'kiyala': 'කියලා',
1227
+ 'gana': 'ගැන',
1228
+ 'genna': 'ගන්න',
1229
+ 'danna': 'දන්න',
1230
+ 'dennna': 'දෙන්න',
1231
+ 'mata': 'මට',
1232
+ 'ape': 'අපේ',
1233
+ 'thiyenne': 'තියෙන්නේ',
1234
+ 'kiyanawa': 'කියනවා',
1235
+ 'balanawa': 'බලනවා',
1236
+ 'pennanna': 'පෙන්නන්න',
1237
+ 'sampura': 'සම්පූර්ණ',
1238
+ 'mudal': 'මුදල්',
1239
+ 'pasal': 'පාසල්',
1240
+ 'rajaye': 'රජයේ',
1241
+ 'arthikaya': 'ආර්ථිකය',
1242
+ 'kara': 'කර',
1243
+ 'karanna': 'කරන්න',
1244
+ 'giya': 'ගිය',
1245
+ 'yanawa': 'යනවා',
1246
+ 'enawa': 'එනවා',
1247
+ 'inne': 'ඉන්නේ',
1248
+ 'yanna': 'යන්න',
1249
+ 'enna': 'එන්න'
1250
+ }
1251
+
1252
+ # Simple word-by-word replacement
1253
+ words = text.lower().split()
1254
+ transliterated_words = []
1255
+
1256
+ for word in words:
1257
+ # Remove punctuation for mapping
1258
+ clean_word = re.sub(r'[^\w]', '', word)
1259
+ if clean_word in singlish_to_sinhala_map:
1260
+ transliterated_words.append(singlish_to_sinhala_map[clean_word])
1261
+ else:
1262
+ transliterated_words.append(word) # Keep original if no mapping
1263
+
1264
+ logger.info(f"Rule-based transliteration: '{text}' -> '{' '.join(transliterated_words)}'")
1265
+ return ' '.join(transliterated_words)
1266
+
1267
+ except Exception as e:
1268
+ logger.error(f"Rule-based transliteration error: {e}")
1269
+ return text # Return original text if transliteration fails
1270
+
1271
+ def translate_text(text: str, target_language: str = 'en') -> str:
1272
+ """Translate text using Google Translate"""
1273
+ try:
1274
+ result = translator.translate(text, dest=target_language)
1275
+ return result.text
1276
+ except Exception as e:
1277
+ logger.error(f"Translation error: {e}")
1278
+ return text # Return original text if translation fails
1279
+
1280
+ def process_multilingual_input(user_message: str) -> tuple:
1281
+ """
1282
+ AI-enhanced multilingual input processing:
1283
+ AI Language Detection -> AI Transliteration -> Google Translate -> English
1284
+ """
1285
+ processed_message = user_message
1286
+ transliteration_used = False
1287
+ ai_detection_used = False
1288
+
1289
+ # Step 1: AI-based language detection
1290
+ language_info = ai_detect_language(user_message)
1291
+ original_language = language_info['language']
1292
+ confidence = language_info['confidence']
1293
+ detection_method = language_info['detection_method']
1294
+
1295
+ logger.info(f"Language detection: {original_language} (confidence: {confidence:.2f}, method: {detection_method})")
1296
+
1297
+ # Determine processing based on detected language
1298
+ if language_info['is_sinhala_unicode']:
1299
+ # Direct Sinhala Unicode -> English translation
1300
+ logger.info("Processing Sinhala Unicode input")
1301
+ original_language = 'si'
1302
+ needs_translation = True
1303
+ processed_message = translate_text(user_message, 'en')
1304
+ logger.info(f"Translated from Sinhala: '{user_message}' -> '{processed_message}'")
1305
+
1306
+ elif language_info['is_romanized_sinhala']:
1307
+ # Romanized Sinhala -> AI Transliteration -> Translation
1308
+ logger.info("Processing Romanized Sinhala (Singlish) input")
1309
+ original_language = 'singlish'
1310
+ needs_translation = True
1311
+ transliteration_used = True
1312
+ ai_detection_used = detection_method == 'ai'
1313
+
1314
+ try:
1315
+ # Step 1: AI-based transliteration
1316
+ sinhala_text = ai_transliterate_singlish_to_sinhala(user_message)
1317
+ logger.info(f"AI transliterated: '{user_message}' -> '{sinhala_text}'")
1318
+
1319
+ # Step 2: Translate Sinhala to English for search
1320
+ processed_message = translate_text(sinhala_text, 'en')
1321
+ logger.info(f"Translated to English: '{sinhala_text}' -> '{processed_message}'")
1322
+
1323
+ except Exception as e:
1324
+ logger.error(f"Error in AI processing pipeline: {e}")
1325
+ # Fallback: try direct translation or keep original
1326
+ try:
1327
+ processed_message = translate_text(user_message, 'en')
1328
+ logger.info(f"Fallback translation: '{user_message}' -> '{processed_message}'")
1329
+ except:
1330
+ processed_message = user_message
1331
+ needs_translation = False
1332
+ transliteration_used = False
1333
+ logger.info("Using original text for search")
1334
+
1335
+ else:
1336
+ # English or other languages
1337
+ logger.info("Processing as English input")
1338
+ original_language = 'en'
1339
+ needs_translation = False
1340
+ processed_message = user_message
1341
+
1342
+ return processed_message, original_language, needs_translation, transliteration_used, ai_detection_used, confidence
1343
+
1344
+ def translate_response_if_needed(response: str, original_language: str) -> str:
1345
+ """Translate response back to original language if needed"""
1346
+ if original_language == 'si':
1347
+ # Translate back to Sinhala
1348
+ try:
1349
+ translated_response = translate_text(response, 'si')
1350
+ logger.info(f"Translated response to Sinhala: '{response[:100]}...' -> '{translated_response[:100]}...'")
1351
+ return translated_response
1352
+ except Exception as e:
1353
+ logger.error(f"Error translating response to Sinhala: {e}")
1354
+ return response
1355
+ elif original_language == 'singlish':
1356
+ # For Singlish, we can optionally provide a mixed response
1357
+ # For now, keep English response but could enhance later
1358
+ return response
1359
+
1360
+ return response
1361
+
1362
+ def get_pinecone_index():
1363
+ """Get the budget proposals Pinecone index"""
1364
+ try:
1365
+ return pc.Index(BUDGET_INDEX_NAME)
1366
+ except Exception as e:
1367
+ logger.error(f"Error accessing Pinecone index: {e}")
1368
+ return None
1369
+
1370
+ def search_budget_proposals(query: str) -> str:
1371
+ """Search budget proposals using the semantic search API"""
1372
+ try:
1373
+ import requests
1374
+
1375
+ # Use the deployed semantic search API
1376
+ response = requests.post(
1377
+ f"https://danulr05-budget-proposals-search-api.hf.space/api/search",
1378
+ json={"query": query, "top_k": 5},
1379
+ timeout=10
1380
+ )
1381
+
1382
+ if response.status_code == 200:
1383
+ data = response.json()
1384
+ results = data.get("results", [])
1385
+
1386
+ if not results:
1387
+ return "No relevant budget proposals found in the database."
1388
+
1389
+ # Build context from search results
1390
+ context_parts = []
1391
+ for result in results[:3]: # Limit to top 3 results
1392
+ file_path = result.get("file_path", "")
1393
+ category = result.get("category", "")
1394
+ summary = result.get("summary", "")
1395
+ cost = result.get("costLKR", "")
1396
+ title = result.get("title", "")
1397
+ content = result.get("content", "") # Get the actual content
1398
+
1399
+ context_parts.append(f"From {file_path} ({category}): {title}")
1400
+ if content:
1401
+ context_parts.append(f"Content: {content}")
1402
+ elif summary:
1403
+ context_parts.append(f"Summary: {summary}")
1404
+ if cost and cost != "No Costing Available":
1405
+ context_parts.append(f"Cost: {cost}")
1406
+
1407
+ return "\n\n".join(context_parts)
1408
+ else:
1409
+ return f"Error accessing semantic search API: {response.status_code}"
1410
+
1411
+ except Exception as e:
1412
+ logger.error(f"Error searching budget proposals: {e}")
1413
+ return f"Error searching database: {str(e)}"
1414
+
1415
+ # Create the RAG tool
1416
+ search_tool = Tool(
1417
+ name="search_budget_proposals",
1418
+ description="Search for relevant budget proposals in the vector database. Use this when you need specific information about budget proposals, costs, policies, or implementation details.",
1419
+ func=search_budget_proposals
1420
+ )
1421
+
1422
+ # Create the prompt template for the agent
1423
+ agent_prompt = ChatPromptTemplate.from_messages([
1424
+ ("system", """You are a helpful assistant for budget proposals in Sri Lanka. You have access to a vector database containing detailed information about various budget proposals. You can communicate in English, Sinhala, and understand Singlish (Sinhala written in English letters).
1425
+
1426
+ When a user asks about budget proposals, you should:
1427
+ 1. Use the search_budget_proposals tool to find relevant information
1428
+ 2. Provide accurate, detailed responses based on the retrieved information
1429
+ 3. Always cite the source documents when mentioning specific proposals
1430
+ 4. Be professional but approachable in any language
1431
+ 5. If the search doesn't return relevant results, acknowledge this and provide general guidance
1432
+ 6. Respond in the same language or style as the user's question when possible
1433
+
1434
+ Guidelines:
1435
+ - Always use the search tool for specific questions about budget proposals
1436
+ - Include source citations for any mention of proposals, costs, policies, revenue, or implementation
1437
+ - Keep responses clear and informative in any language
1438
+ - Use a balanced tone - helpful but not overly casual
1439
+ - If asked about topics not covered, redirect to relevant topics professionally
1440
+ - Be culturally sensitive when discussing Sri Lankan policies and economic matters
1441
+ - When responding in Sinhala, use appropriate formal language for policy discussions"""),
1442
+ MessagesPlaceholder(variable_name="chat_history"),
1443
+ ("human", "{input}"),
1444
+ MessagesPlaceholder(variable_name="agent_scratchpad")
1445
+ ])
1446
+
1447
+ # Store conversation memories for different sessions
1448
+ conversation_memories: Dict[str, ConversationBufferWindowMemory] = {}
1449
+
1450
+ def get_or_create_memory(session_id: str) -> ConversationBufferWindowMemory:
1451
+ """Get or create a memory instance for a session"""
1452
+ if session_id not in conversation_memories:
1453
+ # Create new memory with window of 10 messages (5 exchanges)
1454
+ conversation_memories[session_id] = ConversationBufferWindowMemory(
1455
+ k=10, # Remember last 10 messages
1456
+ return_messages=True,
1457
+ memory_key="chat_history"
1458
+ )
1459
+ logger.info(f"Created new memory for session: {session_id}")
1460
+
1461
+ return conversation_memories[session_id]
1462
+
1463
+ def create_agent(session_id: str) -> AgentExecutor:
1464
+ """Create a LangChain agent with memory and RAG capabilities"""
1465
+ memory = get_or_create_memory(session_id)
1466
+
1467
+ # Create the agent
1468
+ agent = create_openai_functions_agent(
1469
+ llm=llm,
1470
+ tools=[search_tool],
1471
+ prompt=agent_prompt
1472
+ )
1473
+
1474
+ # Create agent executor with memory
1475
+ agent_executor = AgentExecutor(
1476
+ agent=agent,
1477
+ tools=[search_tool],
1478
+ memory=memory,
1479
+ verbose=False,
1480
+ handle_parsing_errors=True
1481
+ )
1482
+
1483
+ return agent_executor
1484
+
1485
+ def get_available_pdfs() -> List[str]:
1486
+ """Dynamically get list of available PDF files from assets directory"""
1487
+ try:
1488
+ import os
1489
+ pdf_dir = "assets/pdfs"
1490
+ if os.path.exists(pdf_dir):
1491
+ pdf_files = [f for f in os.listdir(pdf_dir) if f.lower().endswith('.pdf')]
1492
+ return pdf_files
1493
+ else:
1494
+ # Fallback to known PDFs if directory doesn't exist
1495
+ return ['MLB.pdf', 'Cigs.pdf', 'Elec.pdf', 'Audit_EPF.pdf', 'EPF.pdf', 'Discretion.pdf', '1750164001872.pdf']
1496
+ except Exception as e:
1497
+ logger.error(f"Error getting available PDFs: {e}")
1498
+ # Fallback to known PDFs
1499
+ return ['MLB.pdf', 'Cigs.pdf', 'Elec.pdf', 'Audit_EPF.pdf', 'EPF.pdf', 'Discretion.pdf', '1750164001872.pdf']
1500
+
1501
+ def extract_sources_from_response(response: str) -> List[str]:
1502
+ """Extract source documents mentioned in the response"""
1503
+ sources = []
1504
+
1505
+ # Get dynamically available PDF files
1506
+ available_pdfs = get_available_pdfs()
1507
+
1508
+ # Look for source patterns like "(Source: MLB.pdf)" or "(Sources: MLB.pdf, EPF.pdf)"
1509
+ for pdf in available_pdfs:
1510
+ if pdf in response:
1511
+ sources.append(pdf)
1512
+
1513
+ return list(set(sources)) # Remove duplicates
1514
+
1515
+ def generate_response_with_rag(user_message: str, session_id: str) -> Dict[str, Any]:
1516
+ """Generate response using RAG with memory and multilingual support"""
1517
+ try:
1518
+ # Process multilingual input
1519
+ processed_message, original_language, needs_translation, transliteration_used, ai_detection_used, confidence = process_multilingual_input(user_message)
1520
+ logger.info(f"Input processing: original='{user_message}', processed='{processed_message}', lang='{original_language}', transliteration='{transliteration_used}', ai_detection='{ai_detection_used}', confidence='{confidence:.2f}'")
1521
+
1522
+ # Get or create memory for this session
1523
+ memory = get_or_create_memory(session_id)
1524
+
1525
+ # Search for relevant context using processed (English) message
1526
+ search_context = search_budget_proposals(processed_message)
1527
+
1528
+ # Get conversation history for context
1529
+ chat_history = memory.chat_memory.messages
1530
+ conversation_context = ""
1531
+ if chat_history:
1532
+ # Get last few messages for context
1533
+ recent_messages = chat_history[-6:] # Last 3 exchanges
1534
+ conversation_parts = []
1535
+ for msg in recent_messages:
1536
+ if isinstance(msg, HumanMessage):
1537
+ conversation_parts.append(f"User: {msg.content}")
1538
+ elif isinstance(msg, AIMessage):
1539
+ conversation_parts.append(f"Assistant: {msg.content}")
1540
+ conversation_context = "\n".join(conversation_parts)
1541
+
1542
+ # Create a prompt with conversation history and retrieved context
1543
+ language_instruction = ""
1544
+ if original_language == 'si':
1545
+ language_instruction = "\n\nIMPORTANT: The user asked in Sinhala. Please respond in Sinhala using proper Sinhala script and formal language appropriate for policy discussions."
1546
+ elif original_language == 'singlish':
1547
+ if transliteration_used:
1548
+ language_instruction = "\n\nNote: The user used Romanized Sinhala (transliterated via Swabhasha). Please respond in Sinhala using proper Sinhala script and formal language appropriate for policy discussions."
1549
+ else:
1550
+ language_instruction = "\n\nNote: The user used Singlish (Sinhala words in English letters). You may respond in English but consider using some familiar Sri Lankan terminology where appropriate."
1551
+
1552
+ prompt = f"""You are a helpful assistant for budget proposals in Sri Lanka. You can communicate in English, Sinhala, and understand Singlish.
1553
+
1554
+ Based on the following information from the budget proposals database:
1555
+
1556
+ {search_context}
1557
+
1558
+ {conversation_context}
1559
+
1560
+ Current user question: {processed_message}
1561
+ Original user input: {user_message}
1562
+ {language_instruction}
1563
+
1564
+ Guidelines:
1565
+ - Be professional but approachable in any language
1566
+ - Include specific details from the retrieved information
1567
+ - Cite the source documents when mentioning specific proposals
1568
+ - If the search doesn't return relevant results, acknowledge this and provide general guidance
1569
+ - Keep responses clear and informative
1570
+ - Reference previous conversation context when relevant
1571
+ - Maintain conversation continuity
1572
+ - Be culturally sensitive when discussing Sri Lankan policies
1573
+ - When responding in Sinhala, use appropriate formal language for policy discussions
1574
+
1575
+ Please provide a helpful response:"""
1576
+
1577
+ # Generate response using the LLM directly
1578
+ response = llm.invoke(prompt)
1579
+ response_text = response.content.strip()
1580
+
1581
+ # Translate response back if needed
1582
+ if needs_translation and (original_language == 'si' or (original_language == 'singlish' and transliteration_used)):
1583
+ response_text = translate_response_if_needed(response_text, original_language)
1584
+
1585
+ # Extract sources from response
1586
+ sources = extract_sources_from_response(response_text)
1587
+
1588
+ # Add messages to memory (store original user message for context)
1589
+ memory.chat_memory.add_user_message(user_message)
1590
+ memory.chat_memory.add_ai_message(response_text)
1591
+
1592
+ # Get updated conversation history for context
1593
+ chat_history = memory.chat_memory.messages
1594
+
1595
+ return {
1596
+ "response": response_text,
1597
+ "confidence": "high",
1598
+ "session_id": session_id,
1599
+ "conversation_length": len(chat_history),
1600
+ "memory_used": True,
1601
+ "rag_used": True,
1602
+ "sources": sources,
1603
+ "language_detected": original_language,
1604
+ "translation_used": needs_translation,
1605
+ "transliteration_used": transliteration_used,
1606
+ "ai_detection_used": ai_detection_used,
1607
+ "detection_confidence": confidence
1608
+ }
1609
+
1610
+ except Exception as e:
1611
+ logger.error(f"Error generating response with RAG: {e}")
1612
+ # Provide error message in appropriate language
1613
+ error_message = "I'm sorry, I'm having trouble processing your request right now. Please try again later."
1614
+ if original_language == 'si':
1615
+ try:
1616
+ error_message = translate_text(error_message, 'si')
1617
+ except:
1618
+ pass # Keep English if translation fails
1619
+
1620
+ return {
1621
+ "response": error_message,
1622
+ "confidence": "error",
1623
+ "session_id": session_id,
1624
+ "memory_used": False,
1625
+ "rag_used": False,
1626
+ "sources": [],
1627
+ "language_detected": original_language if 'original_language' in locals() else 'en',
1628
+ "translation_used": False,
1629
+ "transliteration_used": False,
1630
+ "ai_detection_used": False,
1631
+ "detection_confidence": 0.0
1632
+ }
1633
+
1634
+ def clear_session_memory(session_id: str) -> bool:
1635
+ """Clear memory for a specific session"""
1636
+ try:
1637
+ if session_id in conversation_memories:
1638
+ del conversation_memories[session_id]
1639
+ logger.info(f"Cleared memory for session: {session_id}")
1640
+ return True
1641
+ return False
1642
+ except Exception as e:
1643
+ logger.error(f"Error clearing memory: {e}")
1644
+ return False
1645
+
1646
+ @app.route('/api/chat', methods=['POST'])
1647
+ def chat():
1648
+ """Enhanced chat endpoint with memory"""
1649
+ try:
1650
+ data = request.get_json()
1651
+ user_message = data.get('message', '').strip()
1652
+ session_id = data.get('session_id', 'default')
1653
+
1654
+ if not user_message:
1655
+ return jsonify({
1656
+ "error": "Message is required"
1657
+ }), 400
1658
+
1659
+ # Generate response with memory
1660
+ result = generate_response_with_rag(user_message, session_id)
1661
+
1662
+ return jsonify({
1663
+ "response": result["response"],
1664
+ "confidence": result["confidence"],
1665
+ "session_id": session_id,
1666
+ "conversation_length": result.get("conversation_length", 0),
1667
+ "memory_used": result.get("memory_used", False),
1668
+ "rag_used": result.get("rag_used", False),
1669
+ "sources": result.get("sources", []),
1670
+ "user_message": user_message,
1671
+ "language_detected": result.get("language_detected", "en"),
1672
+ "translation_used": result.get("translation_used", False),
1673
+ "transliteration_used": result.get("transliteration_used", False),
1674
+ "ai_detection_used": result.get("ai_detection_used", False),
1675
+ "detection_confidence": result.get("detection_confidence", 0.0)
1676
+ })
1677
+
1678
+ except Exception as e:
1679
+ logger.error(f"Chat API error: {e}")
1680
+ return jsonify({"error": str(e)}), 500
1681
+
1682
+ @app.route('/api/chat/clear', methods=['POST'])
1683
+ def clear_chat():
1684
+ """Clear chat memory for a session"""
1685
+ try:
1686
+ data = request.get_json()
1687
+ session_id = data.get('session_id', 'default')
1688
+
1689
+ success = clear_session_memory(session_id)
1690
+
1691
+ return jsonify({
1692
+ "success": success,
1693
+ "session_id": session_id,
1694
+ "message": "Chat memory cleared successfully" if success else "Session not found"
1695
+ })
1696
+
1697
+ except Exception as e:
1698
+ logger.error(f"Clear chat error: {e}")
1699
+ return jsonify({"error": str(e)}), 500
1700
+
1701
+ @app.route('/api/chat/sessions', methods=['GET'])
1702
+ def list_sessions():
1703
+ """List all active chat sessions"""
1704
+ try:
1705
+ sessions = []
1706
+ for session_id, memory in conversation_memories.items():
1707
+ messages = memory.chat_memory.messages
1708
+ sessions.append({
1709
+ "session_id": session_id,
1710
+ "message_count": len(messages),
1711
+ "last_activity": datetime.now().isoformat() # Simplified for now
1712
+ })
1713
+
1714
+ return jsonify({
1715
+ "sessions": sessions,
1716
+ "total_sessions": len(sessions)
1717
+ })
1718
+
1719
+ except Exception as e:
1720
+ logger.error(f"List sessions error: {e}")
1721
+ return jsonify({"error": str(e)}), 500
1722
+
1723
+ @app.route('/api/chat/history/<session_id>', methods=['GET'])
1724
+ def get_chat_history(session_id: str):
1725
+ """Get chat history for a specific session"""
1726
+ try:
1727
+ if session_id not in conversation_memories:
1728
+ return jsonify({
1729
+ "session_id": session_id,
1730
+ "history": [],
1731
+ "message_count": 0
1732
+ })
1733
+
1734
+ memory = conversation_memories[session_id]
1735
+ messages = memory.chat_memory.messages
1736
+
1737
+ history = []
1738
+ for msg in messages:
1739
+ if isinstance(msg, HumanMessage):
1740
+ history.append({
1741
+ "type": "human",
1742
+ "content": msg.content,
1743
+ "timestamp": datetime.now().isoformat()
1744
+ })
1745
+ elif isinstance(msg, AIMessage):
1746
+ history.append({
1747
+ "type": "ai",
1748
+ "content": msg.content,
1749
+ "timestamp": datetime.now().isoformat()
1750
+ })
1751
+
1752
+ return jsonify({
1753
+ "session_id": session_id,
1754
+ "history": history,
1755
+ "message_count": len(history)
1756
+ })
1757
+
1758
+ except Exception as e:
1759
+ logger.error(f"Get chat history error: {e}")
1760
+ return jsonify({"error": str(e)}), 500
1761
+
1762
+ @app.route('/api/chat/health', methods=['GET'])
1763
+ def chat_health():
1764
+ """Health check for the enhanced chatbot"""
1765
+ try:
1766
+ # Test LangChain connection and vector database
1767
+ test_agent = create_agent("health_check")
1768
+ test_response = test_agent.invoke({"input": "Hello"})
1769
+
1770
+ # Test vector database connection
1771
+ pc_index = get_pinecone_index()
1772
+ vector_db_status = "connected" if pc_index else "disconnected"
1773
+
1774
+ return jsonify({
1775
+ "status": "healthy",
1776
+ "message": "Enhanced budget proposals chatbot with RAG is running",
1777
+ "langchain_status": "connected" if test_response else "disconnected",
1778
+ "vector_db_status": vector_db_status,
1779
+ "rag_enabled": True,
1780
+ "active_sessions": len(conversation_memories),
1781
+ "memory_enabled": True
1782
+ })
1783
+ except Exception as e:
1784
+ return jsonify({
1785
+ "status": "unhealthy",
1786
+ "message": f"Error: {str(e)}"
1787
+ }), 500
1788
+
1789
+ @app.route('/api/chat/debug/<session_id>', methods=['GET'])
1790
+ def debug_session(session_id: str):
1791
+ """Debug endpoint to check session memory"""
1792
+ try:
1793
+ memory_exists = session_id in conversation_memories
1794
+ memory_info = {
1795
+ "session_id": session_id,
1796
+ "memory_exists": memory_exists,
1797
+ "total_sessions": len(conversation_memories),
1798
+ "session_keys": list(conversation_memories.keys())
1799
+ }
1800
+
1801
+ if memory_exists:
1802
+ memory = conversation_memories[session_id]
1803
+ messages = memory.chat_memory.messages
1804
+ memory_info.update({
1805
+ "message_count": len(messages),
1806
+ "messages": [
1807
+ {
1808
+ "type": getattr(msg, 'type', 'unknown'),
1809
+ "content": getattr(msg, 'content', '')[:100] + "..." if len(getattr(msg, 'content', '')) > 100 else getattr(msg, 'content', '')
1810
+ }
1811
+ for msg in messages
1812
+ ]
1813
+ })
1814
+
1815
+ return jsonify(memory_info)
1816
+
1817
+ except Exception as e:
1818
+ logger.error(f"Debug session error: {e}")
1819
+ return jsonify({"error": str(e)}), 500
1820
+
1821
+ @app.route('/api/chat/suggestions', methods=['GET'])
1822
+ def get_chat_suggestions():
1823
+ """Get suggested questions for the chatbot with multilingual support"""
1824
+ suggestions = [
1825
+ "What are the maternity leave benefits proposed? 🤱",
1826
+ "How do the cigarette tax proposals work? 💰",
1827
+ "What changes are proposed for electricity tariffs? ⚡",
1828
+ "Tell me about the EPF audit proposals 📊",
1829
+ "What tax reforms are being suggested? 🏛️",
1830
+ "How will these proposals affect the economy? 📈",
1831
+ "What is the cost of implementing these proposals? 💵",
1832
+ "Can you compare the costs of different proposals? ⚖️",
1833
+ "What are the main benefits of these proposals? ✨",
1834
+ "Budget proposals gana kiyanna 📋",
1835
+ "EPF eka gana mokadda thiyenne? 💰",
1836
+ "Electricity bill eka wenas wenawada? ⚡",
1837
+ "Maternity leave benefits kiyannako 🤱",
1838
+ "මේ budget proposals වල cost එක කීයද? 💵",
1839
+ "රජයේ ආර්థික ප්‍රතිපත්ති ගැන කියන්න 🏛️"
1840
+ ]
1841
+
1842
+ return jsonify({
1843
+ "suggestions": suggestions,
1844
+ "supported_languages": ["English", "Sinhala", "Singlish"]
1845
+ })
1846
+
1847
+ @app.route('/api/chat/available-pdfs', methods=['GET'])
1848
+ def get_available_pdfs_endpoint():
1849
+ """Get list of available PDF files for debugging"""
1850
+ try:
1851
+ available_pdfs = get_available_pdfs()
1852
+ return jsonify({
1853
+ "available_pdfs": available_pdfs,
1854
+ "count": len(available_pdfs),
1855
+ "pdf_directory": "assets/pdfs"
1856
+ })
1857
+ except Exception as e:
1858
+ logger.error(f"Error getting available PDFs: {e}")
1859
+ return jsonify({"error": str(e)}), 500
1860
+
1861
+ @app.route('/api/chat/detect-language', methods=['POST'])
1862
+ def detect_language():
1863
+ """Test language detection functionality"""
1864
+ try:
1865
+ data = request.get_json()
1866
+ text = data.get('text', '').strip()
1867
+
1868
+ if not text:
1869
+ return jsonify({
1870
+ "error": "Text is required"
1871
+ }), 400
1872
+
1873
+ processed_message, original_language, needs_translation, transliteration_used, ai_detection_used, confidence = process_multilingual_input(text)
1874
+
1875
+ return jsonify({
1876
+ "original_text": text,
1877
+ "processed_text": processed_message,
1878
+ "language_detected": original_language,
1879
+ "translation_needed": needs_translation,
1880
+ "transliteration_used": transliteration_used,
1881
+ "ai_detection_used": ai_detection_used,
1882
+ "detection_confidence": confidence,
1883
+ "contains_sinhala": detect_sinhala_content(text),
1884
+ "is_singlish": detect_singlish(text)
1885
+ })
1886
+
1887
+ except Exception as e:
1888
+ logger.error(f"Language detection error: {e}")
1889
+ return jsonify({"error": str(e)}), 500
1890
+
1891
+ @app.route('/', methods=['GET'])
1892
+ def home():
1893
+ """Home endpoint with API documentation"""
1894
+ return jsonify({
1895
+ "message": "Multilingual Budget Proposals Chatbot API with Swabhasha Pipeline",
1896
+ "version": "2.1.0",
1897
+ "supported_languages": ["English", "Sinhala", "Romanized Sinhala (Singlish)"],
1898
+ "features": ["RAG", "Memory", "Swabhasha Transliteration", "Google Translation", "FAISS Vector Store"],
1899
+ "pipeline": "Romanized Sinhala → Swabhasha → Sinhala Script → Google Translate → English → LLM → Response",
1900
+ "endpoints": {
1901
+ "POST /api/chat": "Chat with memory, RAG, and multilingual support",
1902
+ "POST /api/chat/clear": "Clear chat memory",
1903
+ "GET /api/chat/sessions": "List active sessions",
1904
+ "GET /api/chat/history/<session_id>": "Get chat history",
1905
+ "GET /api/chat/health": "Health check",
1906
+ "GET /api/chat/suggestions": "Get suggested questions (multilingual)",
1907
+ "GET /api/chat/available-pdfs": "Get available PDF files",
1908
+ "POST /api/chat/detect-language": "Test language detection"
1909
+ },
1910
+ "status": "running"
1911
+ })
1912
+
1913
+ if __name__ == '__main__':
1914
+ app.run(debug=False, host='0.0.0.0', port=7860)
1915
+ #!/usr/bin/env python3
1916
+ """
1917
+ Enhanced Budget Proposals Chatbot API using LangChain with Memory and Agentic RAG
1918
+ """
1919
+
1920
+ from flask import Flask, request, jsonify
1921
+ from flask_cors import CORS
1922
+ import os
1923
+ import logging
1924
+ import json
1925
+ from datetime import datetime
1926
+ from typing import Dict, List, Any
1927
+
1928
+ # LangChain imports
1929
+ from langchain_google_genai import ChatGoogleGenerativeAI
1930
+ from langchain.memory import ConversationBufferWindowMemory
1931
+ from langchain.schema import HumanMessage, AIMessage
1932
+ from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder
1933
+ from langchain.chains import LLMChain
1934
+ from langchain_community.chat_message_histories import RedisChatMessageHistory
1935
+ from langchain.tools import Tool
1936
+ from langchain.agents import AgentExecutor, create_openai_functions_agent
1937
+ from langchain.agents.openai_functions_agent.base import OpenAIFunctionsAgent
1938
+ from langchain.schema import BaseMessage
1939
+
1940
+ # Vector database imports
1941
+ from pinecone import Pinecone
1942
+ from sentence_transformers import SentenceTransformer
1943
+
1944
+ # Language detection and translation imports
1945
+ from googletrans import Translator
1946
+ import re
1947
+ import requests
1948
+ import json
1949
+
1950
+ # AI-based language processing imports
1951
+ from transformers import pipeline
1952
+ import torch
1953
+
1954
  app = Flask(__name__)
1955
  CORS(app)
1956
 
 
1986
  # Initialize translator
1987
  translator = Translator()
1988
 
1989
+ # Initialize AI-based language detection and transliteration models
1990
+ logger.info("Loading AI models...")
1991
+ try:
1992
+ # Use Google Translate's language detection which supports Sinhala
1993
+ # This is more reliable for Sinhala than the HF model
1994
+ language_detector = "google_translate" # Use Google Translate for detection
1995
+ logger.info("Using Google Translate for language detection (supports Sinhala)")
1996
+ except Exception as e:
1997
+ logger.error(f"Failed to initialize language detection: {e}")
1998
+ language_detector = None
1999
+
2000
+ try:
2001
+ # Sinhala transliteration model
2002
+ sinhala_transliterator = pipeline(
2003
+ "text2text-generation",
2004
+ model="deshanksuman/swabhashambart50SinhalaTransliteration"
2005
+ )
2006
+ logger.info("Sinhala transliteration model loaded successfully")
2007
+ except Exception as e:
2008
+ logger.error(f"Failed to load transliteration model: {e}")
2009
+ sinhala_transliterator = None
2010
+
2011
  def detect_sinhala_content(text: str) -> bool:
2012
  """Detect if text contains Sinhala characters"""
2013
  # Sinhala Unicode range: U+0D80 to U+0DFF
2014
  sinhala_pattern = re.compile(r'[\u0D80-\u0DFF]')
2015
  return bool(sinhala_pattern.search(text))
2016
 
2017
+ def ai_detect_language(text: str) -> Dict[str, Any]:
2018
+ """Enhanced language detection using Google Translate (supports Sinhala)"""
2019
+ try:
2020
+ if language_detector is None:
2021
+ # Fallback to rule-based detection
2022
+ return rule_based_language_detection(text)
2023
+
2024
+ # Check for Sinhala Unicode first (most reliable)
2025
+ has_sinhala_unicode = detect_sinhala_content(text)
2026
+ if has_sinhala_unicode:
2027
+ return {
2028
+ 'language': 'si',
2029
+ 'confidence': 0.95,
2030
+ 'is_sinhala_unicode': True,
2031
+ 'is_romanized_sinhala': False,
2032
+ 'is_english': False,
2033
+ 'detection_method': 'unicode_detection'
2034
+ }
2035
+
2036
+ # Use Google Translate for language detection
2037
+ try:
2038
+ detection_result = translator.detect(text)
2039
+ detected_lang = detection_result.lang
2040
+ confidence = detection_result.confidence
2041
+
2042
+ # Check if it's romanized Sinhala based on content analysis
2043
+ is_romanized_sinhala = (
2044
+ detected_lang in ['en', 'unknown'] and
2045
+ detect_singlish(text)
2046
+ )
2047
+
2048
+ # Override detection if Singlish patterns are strong
2049
+ if is_romanized_sinhala:
2050
+ detected_lang = 'singlish'
2051
+ confidence = max(0.7, confidence) # Boost confidence for Singlish
2052
+
2053
+ return {
2054
+ 'language': detected_lang,
2055
+ 'confidence': confidence,
2056
+ 'is_sinhala_unicode': False,
2057
+ 'is_romanized_sinhala': is_romanized_sinhala,
2058
+ 'is_english': detected_lang == 'en' and not is_romanized_sinhala,
2059
+ 'detection_method': 'google_translate'
2060
+ }
2061
+
2062
+ except Exception as e:
2063
+ logger.error(f"Google Translate detection failed: {e}")
2064
+ # Fallback to rule-based with Singlish detection
2065
+ return enhanced_rule_based_detection(text)
2066
+
2067
+ except Exception as e:
2068
+ logger.error(f"Language detection failed: {e}")
2069
+ return rule_based_language_detection(text)
2070
+
2071
+ def enhanced_rule_based_detection(text: str) -> Dict[str, Any]:
2072
+ """Enhanced rule-based detection with better Singlish recognition"""
2073
+ has_sinhala_unicode = detect_sinhala_content(text)
2074
+ is_romanized_sinhala = detect_singlish(text) and not has_sinhala_unicode
2075
+
2076
+ # More sophisticated Singlish detection
2077
+ if not has_sinhala_unicode and not is_romanized_sinhala:
2078
+ # Check for common Sinhala sentence patterns in English letters
2079
+ sinhala_patterns = [
2080
+ r'\b(mokadda|kohomada|api|oya|mama)\b',
2081
+ r'\b(eka|meka|thiyenne|kiyala)\b',
2082
+ r'\b(gana|genna|danna|karanna)\b',
2083
+ r'\b(budget|proposal).*\b(gana|eka)\b'
2084
+ ]
2085
+
2086
+ text_lower = text.lower()
2087
+ pattern_matches = sum(1 for pattern in sinhala_patterns if re.search(pattern, text_lower))
2088
+
2089
+ if pattern_matches >= 1: # Lower threshold for better detection
2090
+ is_romanized_sinhala = True
2091
+
2092
+ if has_sinhala_unicode:
2093
+ language_code = 'si'
2094
+ confidence = 0.9
2095
+ elif is_romanized_sinhala:
2096
+ language_code = 'singlish'
2097
+ confidence = 0.8
2098
+ else:
2099
+ language_code = 'en'
2100
+ confidence = 0.7
2101
+
2102
+ return {
2103
+ 'language': language_code,
2104
+ 'confidence': confidence,
2105
+ 'is_sinhala_unicode': has_sinhala_unicode,
2106
+ 'is_romanized_sinhala': is_romanized_sinhala,
2107
+ 'is_english': language_code == 'en',
2108
+ 'detection_method': 'enhanced_rule_based'
2109
+ }
2110
+
2111
+ def rule_based_language_detection(text: str) -> Dict[str, Any]:
2112
+ """Fallback rule-based language detection"""
2113
+ has_sinhala_unicode = detect_sinhala_content(text)
2114
+ is_romanized_sinhala = detect_singlish(text) and not has_sinhala_unicode
2115
+ is_english = not has_sinhala_unicode and not is_romanized_sinhala
2116
+
2117
+ if has_sinhala_unicode:
2118
+ language_code = 'si'
2119
+ elif is_romanized_sinhala:
2120
+ language_code = 'singlish'
2121
+ else:
2122
+ language_code = 'en'
2123
+
2124
+ return {
2125
+ 'language': language_code,
2126
+ 'confidence': 0.8, # Default confidence for rule-based
2127
+ 'is_sinhala_unicode': has_sinhala_unicode,
2128
+ 'is_romanized_sinhala': is_romanized_sinhala,
2129
+ 'is_english': is_english,
2130
+ 'detection_method': 'rule_based'
2131
+ }
2132
+
2133
  def detect_singlish(text: str) -> bool:
2134
  """Detect common Singlish patterns and words"""
2135
  singlish_words = [
 
2148
  # Consider it Singlish if it has 2 or more Singlish words
2149
  return singlish_word_count >= 2
2150
 
2151
+ def ai_transliterate_singlish_to_sinhala(text: str) -> str:
2152
+ """AI-based transliteration from Romanized Sinhala to Sinhala script"""
2153
  try:
2154
+ if sinhala_transliterator is None:
2155
+ # Fallback to rule-based transliteration
2156
+ logger.info("AI transliterator not available, using rule-based fallback")
2157
+ return rule_based_transliterate_singlish_to_sinhala(text)
2158
 
2159
+ # Use AI model for transliteration
2160
+ result = sinhala_transliterator(text, max_length=256, num_return_sequences=1)
2161
+ transliterated_text = result[0]['generated_text']
2162
 
2163
+ logger.info(f"AI transliteration: '{text}' -> '{transliterated_text}'")
2164
+ return transliterated_text
2165
+
2166
+ except Exception as e:
2167
+ logger.error(f"AI transliteration failed: {e}")
2168
+ return rule_based_transliterate_singlish_to_sinhala(text)
2169
+
2170
+ def rule_based_transliterate_singlish_to_sinhala(text: str) -> str:
2171
+ """Fallback rule-based transliteration for Romanized Sinhala"""
2172
+ try:
2173
  # Common Singlish to Sinhala mappings (simplified)
2174
  singlish_to_sinhala_map = {
2175
  'mokadda': 'මොකද්ද',
 
2218
  else:
2219
  transliterated_words.append(word) # Keep original if no mapping
2220
 
2221
+ logger.info(f"Rule-based transliteration: '{text}' -> '{' '.join(transliterated_words)}'")
2222
  return ' '.join(transliterated_words)
2223
 
2224
  except Exception as e:
2225
+ logger.error(f"Rule-based transliteration error: {e}")
2226
  return text # Return original text if transliteration fails
2227
 
2228
  def translate_text(text: str, target_language: str = 'en') -> str:
 
2236
 
2237
  def process_multilingual_input(user_message: str) -> tuple:
2238
  """
2239
+ AI-enhanced multilingual input processing:
2240
+ AI Language Detection -> AI Transliteration -> Google Translate -> English
2241
  """
 
 
2242
  processed_message = user_message
2243
  transliteration_used = False
2244
+ ai_detection_used = False
2245
+
2246
+ # Step 1: AI-based language detection
2247
+ language_info = ai_detect_language(user_message)
2248
+ original_language = language_info['language']
2249
+ confidence = language_info['confidence']
2250
+ detection_method = language_info['detection_method']
2251
+
2252
+ logger.info(f"Language detection: {original_language} (confidence: {confidence:.2f}, method: {detection_method})")
2253
 
2254
+ # Determine processing based on detected language
2255
+ if language_info['is_sinhala_unicode']:
2256
+ # Direct Sinhala Unicode -> English translation
2257
+ logger.info("Processing Sinhala Unicode input")
2258
  original_language = 'si'
2259
  needs_translation = True
2260
  processed_message = translate_text(user_message, 'en')
2261
  logger.info(f"Translated from Sinhala: '{user_message}' -> '{processed_message}'")
2262
+
2263
+ elif language_info['is_romanized_sinhala']:
2264
+ # Romanized Sinhala -> AI Transliteration -> Translation
2265
+ logger.info("Processing Romanized Sinhala (Singlish) input")
2266
  original_language = 'singlish'
2267
  needs_translation = True
2268
  transliteration_used = True
2269
+ ai_detection_used = detection_method == 'ai'
2270
 
2271
  try:
2272
+ # Step 1: AI-based transliteration
2273
+ sinhala_text = ai_transliterate_singlish_to_sinhala(user_message)
2274
+ logger.info(f"AI transliterated: '{user_message}' -> '{sinhala_text}'")
2275
 
2276
  # Step 2: Translate Sinhala to English for search
2277
  processed_message = translate_text(sinhala_text, 'en')
2278
+ logger.info(f"Translated to English: '{sinhala_text}' -> '{processed_message}'")
2279
 
2280
  except Exception as e:
2281
+ logger.error(f"Error in AI processing pipeline: {e}")
2282
  # Fallback: try direct translation or keep original
2283
  try:
2284
  processed_message = translate_text(user_message, 'en')
2285
+ logger.info(f"Fallback translation: '{user_message}' -> '{processed_message}'")
2286
  except:
2287
  processed_message = user_message
2288
  needs_translation = False
2289
  transliteration_used = False
2290
+ logger.info("Using original text for search")
2291
 
2292
+ else:
2293
+ # English or other languages
2294
+ logger.info("Processing as English input")
2295
+ original_language = 'en'
2296
+ needs_translation = False
2297
+ processed_message = user_message
2298
+
2299
+ return processed_message, original_language, needs_translation, transliteration_used, ai_detection_used, confidence
2300
 
2301
  def translate_response_if_needed(response: str, original_language: str) -> str:
2302
  """Translate response back to original language if needed"""
 
2473
  """Generate response using RAG with memory and multilingual support"""
2474
  try:
2475
  # Process multilingual input
2476
+ processed_message, original_language, needs_translation, transliteration_used, ai_detection_used, confidence = process_multilingual_input(user_message)
2477
+ logger.info(f"Input processing: original='{user_message}', processed='{processed_message}', lang='{original_language}', transliteration='{transliteration_used}', ai_detection='{ai_detection_used}', confidence='{confidence:.2f}'")
2478
 
2479
  # Get or create memory for this session
2480
  memory = get_or_create_memory(session_id)
 
2528
  - Maintain conversation continuity
2529
  - Be culturally sensitive when discussing Sri Lankan policies
2530
  - When responding in Sinhala, use appropriate formal language for policy discussions
 
 
 
2531
 
2532
  Please provide a helpful response:"""
2533
 
 
2559
  "sources": sources,
2560
  "language_detected": original_language,
2561
  "translation_used": needs_translation,
2562
+ "transliteration_used": transliteration_used,
2563
+ "ai_detection_used": ai_detection_used,
2564
+ "detection_confidence": confidence
2565
  }
2566
 
2567
  except Exception as e:
 
2583
  "sources": [],
2584
  "language_detected": original_language if 'original_language' in locals() else 'en',
2585
  "translation_used": False,
2586
+ "transliteration_used": False,
2587
+ "ai_detection_used": False,
2588
+ "detection_confidence": 0.0
2589
  }
2590
 
2591
  def clear_session_memory(session_id: str) -> bool:
 
2627
  "user_message": user_message,
2628
  "language_detected": result.get("language_detected", "en"),
2629
  "translation_used": result.get("translation_used", False),
2630
+ "transliteration_used": result.get("transliteration_used", False),
2631
+ "ai_detection_used": result.get("ai_detection_used", False),
2632
+ "detection_confidence": result.get("detection_confidence", 0.0)
2633
  })
2634
 
2635
  except Exception as e:
 
2827
  "error": "Text is required"
2828
  }), 400
2829
 
2830
+ processed_message, original_language, needs_translation, transliteration_used, ai_detection_used, confidence = process_multilingual_input(text)
2831
 
2832
  return jsonify({
2833
  "original_text": text,
 
2835
  "language_detected": original_language,
2836
  "translation_needed": needs_translation,
2837
  "transliteration_used": transliteration_used,
2838
+ "ai_detection_used": ai_detection_used,
2839
+ "detection_confidence": confidence,
2840
  "contains_sinhala": detect_sinhala_content(text),
2841
  "is_singlish": detect_singlish(text)
2842
  })