CatGPT beta2

更新

  • 现在使用随机权来决定文字输出,随机权定义为词频的平方
  • 添加了词组模式,如果一组词重复出现则捆绑输出
  • 添加了标点符号(这只是一个尝试)

功能

  • 根据输入的单词生成一段话(当前训练材料不足,生成词数大约只有 \(50\) 左右)
  • 根据一段材料自训练

原理

  • 统计累计词频,为词频计入权重
  • 统计使用次数(防止循环用词等情况)

注意

  • 并不支持根据未见过的词生成句子,因此太偏僻的不行,不过你可以通过喂 AI 一篇带这个词的文章来让它学

使用

  • 主函数上方的 #define TRAIN,注释掉即可使用 test(),即根据词生成句子,不注释掉则可以根据本地 train 文件里的内容自训练
  • 请务必保证有一个 info 文件,如果你想重头训练,可以删掉 info 中得全部内容,但请务必保留一个 \(0\)

如果你不想动脑子地使用,也可以看下面这一版

  • 测试:注释掉主函数上方的 #define TRAIN,然后在主函数里修改 test() 函数的值,这个值是你句子的第一个单词
  • 训练:向 train 文件中粘贴文本,然后取消对 #define TRAIN 的注释,直接编译运行程序即可

声明

  • 这只是一个尝试,代码实现与生成效果比较烂,仅供娱乐与参考使用

效果例(beta2)

cat girls are left the castle , it has a long as usual mr . b returned , said he saw cth said tao ge had completed his head , tao ge is on the castle , and looked at this time , tao ge to be honest there is not a bean tree and the tree and regulations every time , tao ge , he saw that the cat lady falls behind like a cat girls are quite peculiar 
---
猫娘们都离开了城堡,有一个和往常一样长的 B 先生回来了,说他看见cth说涛哥已经完成了他的头,涛哥就在城堡上,看着这一次,涛哥老实说没有一棵豆树,每次都有树和规定,涛哥,他看见猫娘像猫娘一样落在后面,很奇怪
cat lady back to the tree tao ge feels ashamed to grow in the castle, as he had a small village but also responsible for the person from the house so he quickly dodged and finally arrived at this moment , and scratched his head , and the castle needs to the castle , he had completed his hat
---
猫娘回到树上,涛哥觉得在城堡里长大很羞耻,因为他有一个小村庄,但也负责从房子里出来的人,所以他很快躲开了,终于到了这一刻,挠了挠头,城堡需要城堡,他已经完成了帽子
cth scratched his eerie smile on the leaves , but also took off the castle , but the tree king ? hurry to the castle where tao ge was wrong strange , but he saw that the castle where tao ge , the castle manager mr . b is not right branch and the castle , it is already as much larger and doesnt look dazed and the tree , and doesnt look very smart . b is also recite praises loudly saying hi , but cth said tao ges hand holding hands high school gate like this time , they will be punished for a living how could build a big deal because he will be one bite dont know why did you can only found that read transport little cat lady this happens
---
cth在树叶上挠了挠他诡异的笑容,也离开了城堡,但树王呢?匆匆赶到涛哥所在的城堡,奇怪的是,他看到了涛哥所在城堡的城堡经理 B 先生。不是右边的树枝和城堡,它已经大得多了,看起来不那么茫然,也不是那棵树,看起来不太聪明。B 先生也大声地念着赞歌打招呼,但cth说涛哥像这次这样手牵手高中大门,他们怎么会被罚卫生,因为他会被咬一口,不知道为什么,你只能找到读运输的小猫女士这种事发生
huge had a walk alone until today is not enough , tao ge , and the castle this sense , he asked tao ge was his head vigorously
---
huge 独自一人走到今天还不够,涛哥,还有城堡这种感觉,他用力问涛哥是不是他的头
cat lady this was too crowded around tao ge , and said , but also the castle , and the tree , and people who was oceansofstars , and each cat girl had never come knocking on the castle , the relationship between doujiao . tao ge used to be done ! ! ! ! i can be checked for a water well in a new day because the cat girl is not to see any pigs with a bean tree , he suddenly heard some noise ahead . at tao ge had no time , and the chef has been caused by the castle where he quickly jumped straight out of him to find me looking down regained some line segment tree , i dont have any pigs ? oh my buddy , tao ge lives on the castle manager still occasionally invites us are you want to see them from the chairman tree and asked sorry , and said that morning sunshine , and asked huge explained to the castle this moment
---
猫娘这一次挤在陶哥身边,说,还有城堡,还有树,还有谁是海洋之星,还有每一个猫娘从来没有来敲过城堡,豆角之间的关系。涛哥以前是干的!我可以在新的一天检查一口水井,因为猫娘没有看到任何带豆角树的猪,他突然听到前面有声音。在涛哥没有时间的时候,厨师就被城堡里的他所引起了,他很快就从他身上跳了出来,发现我往下看又恢复了一些线段树,我没有猪吗?哦,我的朋友,涛哥住在城堡里,经理偶尔还会邀请我们,你想从主席树上看到他们吗?他向我们道歉,说那天早上阳光明媚,并向城堡解释了这一刻

下载

运行效果例使用的版本 (0.33%)

浅度训练版本 (1.2084%)

浅度训练版本对标点符号的训练效果并不好

古诗词版本

Download

更新的 CatGPT 版本

#include<bits/stdc++.h>
using namespace std;

namespace hdk{
	namespace Rand{
        random_device __rd;
        /**
         * @brief random number creator
         * @note there's some problem of 'device_srand()' under GCC9.3.0(Windows),
         *       if so, try 'time_srand()'
         * @param randt store the methods of the random
        */
        struct __Rand{
            mt19937_64 _Rand;
            long long Rand(){
                return _Rand();
            }
            int SystemRand(long long a,long long b){
                return std::rand()%(b-a+1)+a;
            }
            int RandSignedInt(){
                return (int)Rand();
            }
            int RandSignedInt(int l,int r){
                int res=RandSignedInt();
                while(res<l or res>r) res=RandSignedInt();
                return res;
            }
            int RandInt(){
                return abs(RandSignedInt());
            }
            int RandInt(int a,int b){
                return abs(RandSignedInt())%(b-a+1)+a;
            }
            long long RandSignedLong(){
                return (long long)Rand();
            }
            long long RandSignedLong(long long l,long long r){
                long long res=RandSignedLong();
                while(res<l or res>r)  res=RandSignedLong();
                return res;
            }
            long long RandLong(){
                return llabs(RandSignedLong());
            }
            long long RandLong(long long a,long long b){
                return RandLong()%(b-a+1)+a;
            }
            unsigned long long device_srand(){
                unsigned long long seed=__rd();
                _Rand=mt19937_64(seed);
                return seed;
            }
            unsigned long long time_srand(){
                unsigned long long seed=time(0);
                _Rand=mt19937_64(seed);
                return seed;
            }
            void seed_srand(unsigned long long seed=time(0)){
                _Rand=mt19937_64(seed);
            }
            long double RandReal(int fixed){
                long long res=1;
                for(int i=1;i<=fixed;++i) res*=10;
                int rres=RandLong(0,res);
                cout<<rres<<endl;
                return rres*1.0/res;
            }
            bool access(double access_p){
                long long res=RandLong();
                cout<<res<<endl;
                if(res<=LLONG_MAX*access_p){
                    return true;
                }
                return false;
            }
            template<typename T>
            T randfrom(vector<T>A){
                return A[RandLong(0,(int)A.size()-1)];
            }
            template<typename T>
            T randfrom(T A[],int l,int r){
                return A[RandLong(l,r)];
            }
        }randt;
    }
    using namespace Rand;
}
using namespace hdk;
int store_size=0;
int output_size=0;
/**
 * @brief record the appears times of each word
 * @param word record the word appears
 * @param appear_times record the appear times of each words
 *                     in past trainments
 * there's a set<vec> that sort for the most appears word
*/
struct vec{
    string word;
    int appear_times;
    bool operator <(const vec&A)const{
        if(appear_times==A.appear_times) return word<A.word;
        return appear_times<A.appear_times;
    }
};
set<vec>s[500001];
int cnt=0;
map<string,int>next_word;
map<string,int>appear_time[500001];
/**
 * @brief fixed the chatacter of the trainment material
 * @note please use once before any 'remove_useless'
*/
vector<string>store;
vector<string>store2;
void fixed_training(const string file){
    ifstream _I(file);
    store.clear();
    while(!_I.eof()){
        string h;_I>>h;
        bool flag=false;
        for(char i:h){
            if(i=='.' or i==',' or i=='?' or i=='!' or i==';'){
                store2.clear();
                store2.push_back("");
                for(int j=0;j<=(int)h.length()-1;++j){
                    if(h[j]=='.' or h[j]==',' or h[j]=='?' or h[j]=='!' or h[j]==';'){
                        string fx;fx.push_back(h[j]);
                        store2.push_back(fx);
                        store2.push_back("");
                    }
                    else store2.back().push_back(h[j]);
                }
                flag=true;
                break;
            }
        }
        if(!flag) store.push_back(h);
        else for(string i:store2) if(!i.empty()) store.push_back(i);
    }
    _I.close();
    ofstream _O(file);
    for(string i:store){
        _O<<i<<" ";
    }
}
/**
 * @brief remove character except 'a' to 'z', 'A' to 'Z'
 * and lowercase it
 * @note if it's empty after remove, it still return
 * @note character should be seperated from any words
 * you can using function 'fixed_training'
*/
string remove_useless(string x){
    if(x[0]=='.' or x[0]==',' or x[0]=='!' or x[0]=='?' or x[0]==';') return x;
    string ans;
    for(char i:x){
        if(i>='a' and i<='z'){
            ans.push_back(i);
        }
        if(i>='A' and i<='Z'){
            ans.push_back(i-'A'+'a');
        }
    }
    return ans;
}
vector<string>tot_word;
/**
 * @brief read train record from &in
*/
void read_info(ifstream &in){
    store_size=0;
    int tot;in>>tot;store_size=tot;
    while(tot--){
        string x,y;int n,t;in>>x>>n;
        x=remove_useless(x);
        if(!next_word.count(x)){
            next_word[x]=++cnt;
        }
        int tmp=next_word[x];
        while(n--){
            store_size++;
            in>>y>>t;
            y=remove_useless(y);
            appear_time[tmp][y]=t;
            s[tmp].insert({y,t});
        }
    }
    for(auto i:next_word){
        tot_word.push_back(i.first);
    }
}
#define randword randt.randfrom(tot_word)
/**
 * @brief print new train record to &out
*/
void print_info(ofstream &out){
    out<<next_word.size()<<endl;
    output_size=(int)next_word.size();
    for(auto i:next_word){
        out<<i.first<<" "<<s[i.second].size()<<endl;
        for(auto j:s[i.second]){
            output_size++;
            out<<j.word<<" "<<j.appear_times<<endl;
        }
    }
}
/**
 * @brief train itself from the text material from &in
 * @note please ensure that there's a english text material
 * be wait for train
*/
void train(ifstream &in){
    string x,last="eof";
    while(!in.eof()){
        in>>x;x=remove_useless(x);
        if(x.empty()) continue;
        if(last!="eof"){
            if(!next_word.count(last)){
                next_word[last]=++cnt;
            }
            int tmp=next_word[last];
            if(!appear_time[tmp].count(x)){
                appear_time[tmp][x]=1;
                s[tmp].insert({x,1});
            }
            else{
                auto iter=s[tmp].lower_bound({x,appear_time[tmp][x]});
                auto st=*iter;s[tmp].erase(iter);
                s[tmp].insert({st.word,st.appear_times+1});
                appear_time[tmp][x]+=1;
            }
        }
        last=x;
    }
}
map<string,int>mp;
vector<string>wating_word;
/**
 * @note if one appears up to tied%, then it be tied
*/
const long double tied=0.76;
/**
 * @brief determine what will say next of the string x, and print it
 * @note if there's nothing can be print, then function end, else it
 * will continue to test next word automaticly
 * @param rand_weight
 * the rand of the word will act as it has a weight of each word
 * the weight of each word is the square of its appear times
*/
void test(string x){
    x=remove_useless(x);
    if(x.empty()) return;
    cout<<x<<" ";
    if(!next_word.count(x)) return;
    int tmp=next_word[x];
    wating_word.clear();
    long long maxword=0,totcnt=0;
    maxword=s[tmp].begin()->appear_times;
    for(auto i:s[tmp]){
        totcnt+=i.appear_times;
        for(int j=1;j<=i.appear_times*i.appear_times;++j){
            wating_word.push_back(i.word);
        }
    }
    if(maxword*1.0/totcnt>tied and x!=s[tmp].begin()->word){
        mp[s[tmp].begin()->word]++;
        test(s[tmp].begin()->word);
        return;
    }
    string lt=randt.randfrom(wating_word);int cnt=0;
    while(mp.count(lt) and mp[lt]>=appear_time[tmp][lt]){
        if(cnt>=20) return;
        cnt++;lt=randt.randfrom(wating_word);
    }
    mp[lt]++;
    test(lt);
}
/**
 * #define TRAIN to turn on the train mode
 * in this mod, CatGPT will study from File 'train'
 * if TRAIN not be defined, then act the test
*/
#define TRAIN
int main(){
    #if RAND_MAX==INT_MAX
    randt.device_srand();
    #else
    randt.time_srand();
    #endif
    ifstream _I("info");
    read_info(_I);
    #ifndef TRAIN
    //remenber to change the test info
    test(randword);
    #else
    cout<<"Read Finished -> ";
    fixed_training("train");
    ifstream _I2("train");
    train(_I2);
    cout<<"Train Finished"<<endl;
    ofstream _O("info");
    print_info(_O);
    cout<<"Update: "<<output_size-store_size<<" words"<<endl;
    cout<<"Now: "<<output_size<<" words"<<endl;
    cout<<"Already Used Memory: "<<(int)(next_word.size())*1.0/5000<<"% "<<endl;
    #endif
}
posted @ 2024-10-19 19:47  HaneDaniko  阅读(88)  评论(6编辑  收藏  举报