| 2026-03-01 15:54 | eval_success | Evaluated: Neutral (-0.01) | - - |
| 2026-03-01 15:54 | model_divergence | Cross-model spread 0.50 exceeds threshold (2 models) | - - |
| 2026-03-01 15:54 |
eval
|
Evaluated by deepseek-v3.2: -0.01 (Neutral) 14,573 tokens -0.04 | |
| 2026-03-01 15:54 | rater_validation_warn | Validation warnings for model deepseek-v3.2: 0W 2R | - - |
| 2026-02-28 16:08 | eval_success | Lite evaluated: Mild negative (-0.20) | - - |
| 2026-02-28 16:08 | model_divergence | Cross-model spread 0.85 exceeds threshold (4 models) | - - |
| 2026-02-28 16:08 |
eval
|
Evaluated by llama-4-scout-wai: -0.20 (Mild negative) 0.00 | |
| reasoning ED, slightly negative lean on user data handling |
| 2026-02-28 13:46 | eval_success | Evaluated: Neutral (0.03) | - - |
| 2026-02-28 13:46 | model_divergence | Cross-model spread 0.85 exceeds threshold (4 models) | - - |
| 2026-02-28 13:46 |
eval
|
Evaluated by deepseek-v3.2: +0.03 (Neutral) 14,968 tokens -0.11 | |
| 2026-02-28 13:46 | rater_validation_warn | Validation warnings for model deepseek-v3.2: 1W 0R | - - |
| 2026-02-28 13:15 | model_divergence | Cross-model spread 0.85 exceeds threshold (3 models) | - - |
| 2026-02-28 13:15 | eval_success | Lite evaluated: Mild negative (-0.20) | - - |
| 2026-02-28 13:15 |
eval
|
Evaluated by llama-4-scout-wai: -0.20 (Mild negative) 0.00 | |
| reasoning ED, slightly negative lean on user data handling |
| 2026-02-28 13:15 | rater_validation_warn | Lite validation warnings for model llama-4-scout-wai: 0W 1R | - - |
| 2026-02-28 10:28 | model_divergence | Cross-model spread 0.85 exceeds threshold (3 models) | - - |
| 2026-02-28 10:28 | eval_success | Lite evaluated: Mild negative (-0.20) | - - |
| 2026-02-28 10:28 |
eval
|
Evaluated by llama-4-scout-wai: -0.20 (Mild negative) 0.00 | |
| reasoning ED, slightly negative lean on user data handling |
| 2026-02-28 10:28 | rater_validation_warn | Lite validation warnings for model llama-4-scout-wai: 0W 1R | - - |
| 2026-02-28 09:31 | model_divergence | Cross-model spread 0.85 exceeds threshold (3 models) | - - |
| 2026-02-28 09:31 | eval_success | Light evaluated: Mild negative (-0.20) | - - |
| 2026-02-28 09:31 |
eval
|
Evaluated by llama-4-scout-wai: -0.20 (Mild negative) 0.00 | |
| reasoning ED, slightly negative lean on user data handling |
| 2026-02-28 09:31 | rater_validation_warn | Light validation warnings for model llama-4-scout-wai: 0W 1R | - - |
| 2026-02-28 08:02 | model_divergence | Cross-model spread 0.85 exceeds threshold (3 models) | - - |
| 2026-02-28 08:02 | eval_success | Light evaluated: Moderate positive (0.30) | - - |
| 2026-02-28 08:02 |
eval
|
Evaluated by llama-3.3-70b-wai: +0.30 (Moderate positive) 0.00 | |
| reasoning Exposing privacy abuse |
| 2026-02-28 08:02 | rater_validation_warn | Light validation warnings for model llama-3.3-70b-wai: 0W 1R | - - |
| 2026-02-28 07:46 |
eval
|
Evaluated by llama-3.3-70b-wai: +0.30 (Moderate positive) 0.00 | |
| reasoning Exposing privacy abuse |
| 2026-02-28 07:33 |
eval
|
Evaluated by llama-4-scout-wai: -0.20 (Mild negative) 0.00 | |
| reasoning ED, slightly negative lean on user data handling |
| 2026-02-28 06:42 |
eval
|
Evaluated by llama-3.3-70b-wai: +0.30 (Moderate positive) 0.00 | |
| reasoning Exposing privacy abuse |
| 2026-02-28 06:13 |
eval
|
Evaluated by llama-3.3-70b-wai: +0.30 (Moderate positive) 0.00 | |
| reasoning Exposing privacy abuse |
| 2026-02-28 06:06 |
eval
|
Evaluated by llama-3.3-70b-wai: +0.30 (Moderate positive) -0.20 | |
| reasoning Exposing privacy abuse |
| 2026-02-28 05:17 |
eval
|
Evaluated by llama-4-scout-wai: -0.20 (Mild negative) -0.20 | |
| reasoning ED, slightly negative lean on user data handling |
| 2026-02-28 05:00 |
eval
|
Evaluated by llama-4-scout-wai: 0.00 (Neutral) 0.00 | |
| reasoning ED, slightly negative lean on user data handling |
| 2026-02-28 04:53 |
eval
|
Evaluated by llama-3.3-70b-wai: +0.50 (Moderate positive) 0.00 | |
| reasoning Exposing privacy abuse |
| 2026-02-28 04:51 |
eval
|
Evaluated by llama-4-scout-wai: 0.00 (Neutral) -0.20 | |
| reasoning ED, slightly negative lean on user data handling |
| 2026-02-28 04:19 |
eval
|
Evaluated by llama-3.3-70b-wai: +0.50 (Moderate positive) 0.00 | |
| reasoning Exposing privacy abuse |
| 2026-02-28 03:46 |
eval
|
Evaluated by llama-3.3-70b-wai: +0.50 (Moderate positive) 0.00 | |
| reasoning Exposing privacy abuse |
| 2026-02-28 03:23 |
eval
|
Evaluated by llama-3.3-70b-wai: +0.50 (Moderate positive) 0.00 | |
| reasoning Exposing privacy abuse |
| 2026-02-28 03:17 |
eval
|
Evaluated by llama-3.3-70b-wai: +0.50 (Moderate positive) 0.00 | |
| reasoning Exposing privacy abuse |
| 2026-02-28 02:57 |
eval
|
Evaluated by llama-4-scout-wai: +0.20 (Mild positive) 0.00 | |
| reasoning ED, slightly negative lean on user data handling |
| 2026-02-28 02:55 |
eval
|
Evaluated by llama-4-scout-wai: +0.20 (Mild positive) 0.00 | |
| reasoning ED, slightly negative lean on user data handling |
| 2026-02-28 02:35 |
eval
|
Evaluated by llama-4-scout-wai: +0.20 (Mild positive) 0.00 | |
| reasoning ED, slightly negative lean on user data handling |
| 2026-02-28 02:29 |
eval
|
Evaluated by llama-4-scout-wai: +0.20 (Mild positive) | |
| reasoning ED, slightly negative lean on user data handling |
| 2026-02-28 02:06 |
eval
|
Evaluated by llama-3.3-70b-wai: +0.50 (Moderate positive) | |
| reasoning Exposing privacy abuse |
| 2026-02-28 01:18 |
eval
|
Evaluated by claude-haiku-4-5: +0.65 (Strong positive) | |
| 2026-02-27 00:03 |
eval
|
Evaluated by deepseek-v3.2: +0.14 (Mild positive) 14,007 tokens | |