data.generation.realistic
¶
afccp.data.generation.realistic
¶
Generates realistic AFCCP problem instances by learning from historical data and sampling with a conditional tabular GAN (CTGAN). This module prepares training datasets from past instances, trains/loads CTGAN models, samples synthetic cadets and AFSC utilities, and can augment an existing instance (e.g., 2026) with OTS candidates. It also rebuilds the derived AFCCP parameter structures needed by optimization models (preferences, utilities, eligibility/qual matrices, quotas, and rated OM datasets).
What this module does¶
- AFSC facts & proportions (for policy generation)
-
process_instances_into_afscs_data
: Builds anafscs_data.csv
with AFSC list, accession groups, “all-eligible” flags, USAFA/ROTC proportions, overall PGL proportions, and degree-tier strings. -
CTGAN training data assembly
-
process_instances_into_ctgan_data
: Merges selected years (e.g., 2024/2025) into a single table of features (SOC, CIP1/2, Merit, least‑desired AFSCs) plus per‑AFSC cadet/AFSC utilities. -
Handles 2024 column harmonization (e.g.,
13S1S → USSF_{R/U}
,11U → 18X
) and merges SOC‑segmented AFSCs into generic columns viafix_soc_afscs_to_generic
. -
Model training
-
train_ctgan
: Detects metadata with SDV, enforces [0,1] constraints on Merit and all utility columns, trains the CTGAN, and saves to<support>/CTGAN_*.pkl
. -
Sampling realistic instances
-
generate_ctgan_instance
: Loads a trained CTGAN and samples N cadets; optionally conditions on pilot first‑choice composition (USAFA/ROTC) and sets degree qualification style (degree_qual_type
). -
Re‑scales OM percentiles within SOC, builds cadet preference lists/matrices, AFSC utilities, and rated OM datasets (USAFA/ROTC) consistent with AFCCP expectations.
-
OTS augmentation pipeline
-
augment_2026_data_with_ots
: Adds a large OTS cohort to an existing instance (e.g.,2026_0
) by sampling with CTGAN and stitching the new cadets into all required CSVs (Cadets, Preferences, Utility, Selected, AFSCs Preferences, Rated OM, etc.). -
Degree‑scarce AFSCs are boosted with targeted sampling:
generate_data_with_degree_preference_fixes
→extract_afsc_cip_sampling_information
→sample_cadets_for_degree_conditions
(+ KDE‑based utility samplers). -
Recomputes OM and AFSC rankings for OTS (
re_calculate_ots_om_and_afsc_rankings
), aligns volunteer flags and degrees (align_ots_preferences_and_degrees_somewhat
), rebuilds qual matrices and utilities with eligibility rules (construct_parameter_dictionary_and_augment_data
), and emits fully formed dataframes viaconstruct_full_afsc_preferences_data
,construct_full_cadets_data
, andcompile_new_dataframes
.
Key outputs & file layout¶
-
Writes training/derived data under
<support>/data/
: -
afscs_data.csv
(AFSC facts/proportions) -
ctgan_data.csv
(CTGAN training table) -
Writes a trained model under
<support>/CTGAN_*.pkl
. -
For instance augmentation, writes CSVs under
instances/<export_name>/4. Model Input/
.
Important details & conventions¶
- SOC merging:
fix_soc_afscs_to_generic
consolidates11XX_{R/U}
→11XX
andUSSF_{R/U}
→USSF
for training and downstream sampling while preserving least‑desired columns. - Bounds: Merit and all utility columns are constrained to
[0,1]
during CTGAN training. - OM re-scaling: Within‑SOC percentile normalization ensures comparable distributions for USAFA/ROTC.
- Eligibility coupling: AFSC utilities are zeroed for ineligible/ non‑volunteer cases; missing but eligible entries may be backfilled with OM (rated/USSF and NRL logic differs accordingly).
- Quotas: PGL and SOC quotas are sampled from empirical proportions stored in
afscs_data.csv
, then propagated toquota_*
parameters.
Minimal examples¶
-
Train a model: >>> process_instances_into_ctgan_data(['2024','2025']) >>> train_ctgan(epochs=1000, name='CTGAN_Full')
-
Sample a synthetic instance: >>> p = generate_ctgan_instance(N=1600, name='CTGAN_Full', pilot_condition=True, degree_qual_type='Consistent')
-
Augment 2026 with OTS: >>> augment_2026_data_with_ots(N=3000, import_name='2026_0', export_name='2026O')
Dependencies¶
- SDV (
sdv
):CTGANSynthesizer
,SingleTableMetadata
,Condition
- NumPy, Pandas, SciPy (
gaussian_kde
) for sampling and table ops - AFCCP submodules:
globals
,data.adjustments
,data.preferences
,data.values
,data.support
See also¶
train_ctgan(epochs=1000, printing=True, name='CTGAN_Full')
¶
Train CTGAN to produce realistic data based on the current "ctgan_data" file in the support sub-folder. This function then saves the ".pkl" file back to the support sub-folder
Source code in afccp/data/generation/realistic.py
357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 |
|
generate_ctgan_instance(N=1600, name='CTGAN_Full', pilot_condition=False, degree_qual_type='Consistent')
¶
This procedure takes in the specified number of cadets and then generates a representative problem instance using CTGAN that has been trained from a real class year of cadets
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pilot_condition |
If we want to sample cadets according to pilot preferences (make this more representative) |
False
|
|
name |
Name of the CTGAN model to import |
'CTGAN_Full'
|
|
N |
number of cadets |
1600
|
Returns:
Type | Description |
---|---|
model fixed parameters |
Source code in afccp/data/generation/realistic.py
407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 |
|
augment_2026_data_with_ots(N: int = 3000, import_name: str = '2026_0', export_name: str = '2026O')
¶
Augment a base instance with a synthetic OTS cohort and export a new, fully wired instance.
This pipeline loads the trained CTGAN model and historical CTGAN training table, samples N
realistic cadets (with extra emphasis on degree‑scarce AFSCs), converts them to OTS,
re-computes OM and AFSC utilities under AFCCP rules (eligibility & volunteer logic), and
stitches the new cohort into all downstream CSVs (Cadets, Preferences, Utilities, AFSC
Preferences, CASTLE input, etc.). The result is written to
instances/{export_name}/4. Model Input/
.
Parameters¶
N : int, optional
Number of OTS cadets to generate (default 3000).
import_name : str, optional
Name of the source instance to copy/extend (e.g., '2026_0'
).
Reads input CSVs from instances/{import_name}/4. Model Input/
.
export_name : str, optional
Name of the destination instance to create (e.g., '2026O'
).
Writes outputs to instances/{export_name}/4. Model Input/
.
Workflow¶
1) Load CTGAN training data (<support>/data/ctgan_data.csv
) and AFSCs for the source instance.
2) Load CTGAN model (<support>/CTGAN_Full.pkl
).
3) Targeted sampling for degree‑scarce AFSCs via
generate_data_with_degree_preference_fixes
(with KDE utility bootstrapping), then
sample the remainder from the CTGAN.
4) Force SOC to OTS
, re‑scale OM and blend AFSC utilities with OM / cadet utility using
re_calculate_ots_om_and_afsc_rankings
.
5) Align volunteers and degree fields for OTS with align_ots_preferences_and_degrees_somewhat
(USSF turned off for OTS).
6) Build AFCCP parameter dict and eligibility‑aware AFSC utilities with
construct_parameter_dictionary_and_augment_data
(zero for ineligible/non‑volunteer;
OM backfill where appropriate).
7) Rebuild AFSC preference rankings and matrices with
construct_full_afsc_preferences_data
, and cadet‑side preferences/utilities with
construct_full_cadets_data
.
8) Merge everything with existing source CSVs via compile_new_dataframes
and export.
Files Read¶
<support>/data/ctgan_data.csv
<support>/CTGAN_Full.pkl
instances/{import_name}/4. Model Input/{import_name} AFSCs.csv
instances/{import_name}/4. Model Input/{import_name} AFSCs Preferences.csv
instances/{import_name}/4. Model Input/{import_name} Cadets.csv
instances/{import_name}/4. Model Input/{import_name} Castle Input.csv
Files Written (to instances/{export_name}/4. Model Input/
)¶
{export_name} Cadets.csv
{export_name} AFSCs Preferences.csv
{export_name} AFSCs.csv
(copied base AFSCs, unchanged schema){export_name} Raw Data.csv
(the assembled OTS sampling table){export_name} Castle Input.csv
- Plus augmented matrices produced by
compile_new_dataframes
(e.g., Cadets Preferences, Cadets Utility, Cadets Selected, AFSCs Buckets, OTS Rated OM).
Returns¶
None Side‑effects only. Progress is printed to stdout; artifacts are saved to disk.
Notes¶
- Assumes the CTGAN model is saved as
<support>/CTGAN_Full.pkl
. - Assumes the source instance (
import_name
) contains the standard AFCCP CSVs under4. Model Input/
with 2026‑style schemas. - OTS candidates are excluded from USSF by construction (
USSF Vol = False
, utilities set to 0). - Rated OM for OTS is derived from OM where needed, filtered by eligibility.
Examples¶
augment_2026_data_with_ots(N=3000, import_name='2026_0', export_name='2026O')
Source code in afccp/data/generation/realistic.py
709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 |
|
generate_data_with_degree_preference_fixes(model, full_data, afscs_df)
¶
Generate synthetic cadet data for rare AFSCs, preserving degree distribution preferences and realistic cadet/AFSC utilities.
This function focuses on AFSCs that are difficult to fill ("rare" AFSCs),
generating synthetic cadets in a way that matches observed degree patterns
(CIP1) from historical data. It uses
extract_afsc_cip_sampling_information
to determine sampling quotas and conditions, and
sample_cadets_for_degree_conditions
to produce matching synthetic records. Cadet and AFSC utility values are
then resampled for realism.
Parameters¶
model : object
A generative model instance (e.g., CTGAN) implementing
sample_from_conditions(conditions)
to produce synthetic cadets.
full_data : pandas.DataFrame
Full dataset containing historical cadet and AFSC information.
afscs_df : pandas.DataFrame
DataFrame containing AFSC metadata, including 'OTS Target' values.
Returns¶
pandas.DataFrame Synthetic dataset containing cadets for rare AFSCs with realistic degree distributions and utility values.
Notes¶
- Rare AFSC eligibility is hardcoded as a list of AFSC strings in this function.
- Degree sampling is biased toward more common CIPs by cubic weighting (proportions ∝ frequency³).
- Cadet and AFSC utilities are drawn from kernel density estimators (KDEs) fitted on historical data.
See Also¶
Source code in afccp/data/generation/realistic.py
866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 |
|
extract_afsc_cip_sampling_information(full_data, afscs_rare_eligible, afscs_rare_df)
¶
Extract degree distribution and utility sampling information for rare AFSCs.
This function identifies cadets who have strong mutual preference with specific rare AFSCs (both the AFSC ranks the cadet highly and the cadet ranks the AFSC highly), determines the distribution of primary degrees (CIP1) for those cadets, and constructs constraints to ensure proportional representation in generated synthetic data. It also fits kernel density estimators (KDEs) to model cadet and AFSC utility scores for each AFSC-degree combination.
Parameters¶
full_data : pandas.DataFrame
Full dataset containing cadet records with columns for degree codes (CIP1
),
AFSC utilities (<AFSC>_AFSC
), and cadet preferences (<AFSC>_Cadet
).
afscs_rare_eligible : list of str
List of AFSC codes considered rare and eligible for targeted sampling.
afscs_rare_df : pandas.DataFrame or pandas.Series
Data structure mapping each AFSC to the number of cadets needed to meet
quotas for that AFSC.
Returns¶
total_gen : int
Total number of synthetic cadets to generate across all rare AFSCs.
afsc_cip_data : dict
Mapping of {afsc: pandas.Series}
where the Series index is degree codes
(CIP1) and values are the number of cadets to generate for each degree.
afsc_cip_conditions : dict
Mapping {afsc: {cip: Condition}}
specifying generation constraints for each
AFSC-degree combination.
afsc_util_samplers : dict
Mapping {afsc: callable}
returning AFSC utility samples for a given AFSC.
cadet_util_samplers : dict
Mapping {afsc: callable}
returning cadet utility samples for a given AFSC.
Notes¶
- Only cadets with mutual interest scores > 0.6 for a given AFSC are considered.
- Degree frequencies are cubed to overweight common degrees, then scaled to match
target generation counts using
safe_round
. - For AFSC
62EXE
, target counts are halved due to quota filling difficulty. - Generation quotas are inflated by 40% or at least 3 extra cadets to ensure adequate representation.
Source code in afccp/data/generation/realistic.py
936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 |
|
safe_round(data, decimals: int = 0, axis: int = -1)
¶
Round values while preserving the sum along a given axis.
This function rounds data
to decimals
decimal places but adjusts a minimal
subset of elements so that the rounded values sum to the same (rounded) total
as the original, slice‑by‑slice along axis
. It does this by distributing the
leftover rounding “units” to the entries whose fractional parts are most
favorable (largest magnitude residuals with the correct sign), using a stable
tie‑break so results are deterministic.
Parameters¶
data : numpy.ndarray
Input array to round. Must be numeric. (Other array‑likes are coerced;
behavior is only guaranteed for NumPy arrays.)
decimals : int, optional
Number of decimal places to keep (default 0).
axis : int, optional
Axis along which to preserve the slice sums (default -1). Each 1D slice
along this axis will have its rounded sum equal to the original sum
rounded to decimals
.
Returns¶
numpy.ndarray or same type as data
when feasible
Rounded array with the same shape as data
. If data
is a NumPy array,
a NumPy array is returned. For some other types, the function attempts to
reconstruct the input type after rounding.
Notes¶
- Let
S = sum(data, axis)
andS_r = round(S, decimals)
. The outputy
satisfiessum(y, axis) == S_r
exactly (up to floating‑point representation). - Within each slice, the adjustment is minimal in the sense that only the elements with the largest compatible residuals are modified by ± one unit in the scaled space (10**decimals).
- Time complexity is
O(n log n)
per slice due to sorting; memory usage is linear in the slice size. - This procedure does not enforce monotonicity or ordering of values.
Examples¶
import numpy as np x = np.array([0.24, 0.24, 0.24, 0.24, 0.04]) x.sum(), round(x.sum(), 2) (1.0, 1.0) y = safe_round(x, decimals=1, axis=0) y array([0.2, 0.2, 0.2, 0.2, 0.2]) y.sum() 1.0
X = np.array([[0.333, 0.333, 0.334], ... [0.125, 0.125, 0.750]]) Y = safe_round(X, decimals=2, axis=1) Y array([[0.33, 0.33, 0.34], [0.12, 0.13, 0.75]]) Y.sum(axis=1) array([1. , 1. ])
Source code in afccp/data/generation/realistic.py
1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 |
|
sample_cadets_for_degree_conditions(model, total_gen, afscs_rare_eligible, afsc_cip_data, afsc_cip_conditions)
¶
Generate synthetic cadets matching AFSC-degree sampling conditions.
Iterates over rare AFSCs and their associated degree quotas to generate synthetic cadets using the provided generative model. For each AFSC-degree combination, the function samples cadets that meet the degree condition constraints, appending them to a cumulative dataset.
Parameters¶
model : object
A generative model instance (e.g., CTGAN) implementing
sample_from_conditions(conditions)
to produce synthetic cadets.
total_gen : int
Total number of cadets to generate across all AFSC-degree combinations.
afscs_rare_eligible : list of str
List of AFSC codes considered rare and eligible for targeted generation.
afsc_cip_data : dict
Mapping {afsc: pandas.Series}
where the Series index is degree codes (CIP1)
and values are the number of cadets to generate for each degree.
afsc_cip_conditions : dict
Mapping {afsc: {cip: Condition}}
specifying generation constraints for each
AFSC-degree combination.
Returns¶
pandas.DataFrame A concatenated dataset of synthetic cadets meeting all AFSC-degree constraints.
Notes¶
- This function logs progress to the console, showing both the number and percentage of cadets generated so far.
- Sampling order is AFSC-major, iterating over all degrees within each AFSC before moving to the next AFSC.
- The
count
values inafsc_cip_data
are expected to be integers or convertible to integers.
Source code in afccp/data/generation/realistic.py
1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 |
|