gx-ai-architect commited on
Commit
d8aed39
1 Parent(s): 3be16ca

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -72,7 +72,7 @@ The prompts space for preference tuning were uniformly sampled by source from th
72
 
73
  The preference tuned version of Merlinite-7B-pt shows overall all performance enhancement across the board, with no alignment tax observed, as shown in our evaluation. Surprisingly, we find improvements in mathematical ability measured by GSM8K and MT-Bench, which differs from studies observing decreased math/reasoning after RLHF alignment.
74
 
75
- We also observe a clear correlation between the Mixtral DPO reward scores and MT-Bench scores, as dhown in chart above. The reward score of Best-of-N sampled batch keeps improving til Rejection Sampling Round-2. Model saturates at Rejection sampling round 3, no longer gives improvements on either MT-Bench nor Mixtral-DPO rewards.
76
 
77
  The final Merlinite-7B-pt is the peak checkpoint measured by both Batch-Reward and MT-Bench.
78
 
 
72
 
73
  The preference tuned version of Merlinite-7B-pt shows overall all performance enhancement across the board, with no alignment tax observed, as shown in our evaluation. Surprisingly, we find improvements in mathematical ability measured by GSM8K and MT-Bench, which differs from studies observing decreased math/reasoning after RLHF alignment.
74
 
75
+ We also observe a clear correlation between the Mixtral DPO reward scores and MT-Bench scores, as shown in chart above. The reward score of Best-of-N sampled batch keeps improving til Rejection Sampling Round-2. Model saturates at Rejection sampling round 3, no longer giving improvements on either MT-Bench or Mixtral-DPO rewards.
76
 
77
  The final Merlinite-7B-pt is the peak checkpoint measured by both Batch-Reward and MT-Bench.
78